Data Models
indx’s core domain types are Pydantic v2 models that travel through the pipeline, serialize into index.json, and seal into a portable .indx archive. This page documents every field of every model. For a gentler, concept-level tour of how these objects fit together, see Core objects.
Conventions
Section titled “Conventions”A few rules apply across all models:
- Identifiers are zero-padded, stable strings. Chunks are
chunk_0481, documents aredoc_0007. Ids are assigned in deterministic traversal order (folder lineage, then path, then in-document index), so re-running over unchanged input yields identical ids. - Vectors are
list[float]— 32-bit floats in the on-disk matrix, surfaced as plain Python floats in memory. - Free-form metadata is typed
dict[str, Any]and serialized with sorted keys for diff-friendly output.
RelationType
Section titled “RelationType”Typed graph edges between documents and/or chunks. RelationType is a str enum, so its members serialize to their string values.
class RelationType(str, Enum): SIBLING = "sibling" # same folder / same logical group PARENT = "parent" # folder lineage / containment REFERENCES = "references" # outgoing citation, link, or mention CONTINUES = "continues" # next unit in a split sequence DUPLICATE_OF = "duplicate-of" # near/exact duplicate content| Value | Meaning |
|---|---|
sibling | Same folder or same logical group. |
parent | Folder lineage / containment. |
references | Outgoing citation, link, or mention. |
continues | Next unit in a split sequence. |
duplicate-of | Near or exact duplicate content. |
Source
Section titled “Source”Provenance of a chunk or parsed unit — where the content came from in the walked tree.
| Field | Type | Description |
|---|---|---|
path | str | Original file path, relative to the walked root. Required. |
folder | str | Containing folder, relative to root. Required. |
type | str | Detected or enriched document type, e.g. "policy". Required. |
class Source(BaseModel): path: str folder: str type: strRelation
Section titled “Relation”A typed, directed edge in the knowledge graph. A Relation may connect chunk→document, document→document, or chunk→chunk. When the edge is stored on its owning object (the owner is the implicit source), from_id is omitted.
| Field | Type | Default | Description |
|---|---|---|---|
type | RelationType | — (required) | The kind of edge. |
to | str | — (required) | Target id or path, e.g. "legal/gdpr.md" or "chunk_0482". |
from_id | str | None | None | Source id; implicit (omitted) when stored on the source object. |
weight | float | None | None | Confidence / similarity score in [0, 1] where applicable. |
class Relation(BaseModel): type: RelationType to: str from_id: Optional[str] = None weight: Optional[float] = NoneParsedDoc
Section titled “ParsedDoc”The raw output of a Parser for a single source file, produced in stage 02 Parse before chunking. It carries normalized text plus structured artifacts that the 03 Chunk stage uses to split with structure intact. Returned by Parser.parse(file) -> ParsedDoc.
| Field | Type | Default | Description |
|---|---|---|---|
source | Source | — (required) | Provenance of this parsed unit. |
text | str | — (required) | Normalized full-text rendering. |
blocks | list[dict[str, Any]] | [] | Structural blocks: headings, paragraphs, list items. |
tables | list[dict[str, Any]] | [] | Extracted tables (rows / cells / markdown). |
images | list[dict[str, Any]] | [] | Image refs and captions for VLM enrichment. |
metadata | dict[str, Any] | {} | Parser-supplied raw metadata (title, author, page count). |
class ParsedDoc(BaseModel): source: Source text: str blocks: list[dict[str, Any]] = [] tables: list[dict[str, Any]] = [] images: list[dict[str, Any]] = [] metadata: dict[str, Any] = {}A retrievable unit of content. A Chunk remembers its source document, its position within that document, the ids of its immediate neighbor chunks, and any typed relations. The embedding vector is populated in stage 06 Embed+Pack and is typically not inlined into index.json — it lives in embeddings/.
| Field | Type | Default | Description |
|---|---|---|---|
id | str | — (required) | Stable id, e.g. "chunk_0481". |
text | str | — (required) | The retrievable text payload. |
source | Source | — (required) | Originating document provenance. |
doc_id | str | — (required) | Id of the parent Document. |
index | int | — (required) | 0-based position within the parent document. |
metadata | dict[str, Any] | {} | Enriched values: topics, summary, tags. |
neighbors | list[str] | [] | Adjacent chunk ids (previous, next). |
relations | list[Relation] | [] | Outgoing typed edges from this chunk. |
embedding | list[float] | None | None | Vector, populated in stage 06. May be omitted from index.json. |
class Chunk(BaseModel): id: str text: str source: Source doc_id: str index: int metadata: dict[str, Any] = {} neighbors: list[str] = [] relations: list[Relation] = [] embedding: Optional[list[float]] = NoneDocument
Section titled “Document”One source file, enriched. A Document holds folder lineage, detected type, and the LLM-derived topics / tags / summary from stage 05 Enrich, plus the resolved set of outgoing and incoming references from stage 04 Relate.
| Field | Type | Default | Description |
|---|---|---|---|
id | str | — (required) | Stable id, e.g. "doc_0007". |
path | str | — (required) | Original path relative to root. |
folder | str | — (required) | Containing folder (lineage segment). |
lineage | list[str] | [] | Folder ancestry, root→leaf. |
type | str | — (required) | Detected / enriched document type. |
topics | list[str] | [] | Enrichment-derived topics. |
tags | list[str] | [] | Enrichment-derived tags. |
summary | str | None | None | Enrichment-derived summary. |
chunk_ids | list[str] | [] | Chunks produced from this document, in order. |
references | list[Relation] | [] | Outgoing references resolved in stage 04. |
referenced_by | list[Relation] | [] | Incoming references (reverse edges). |
metadata | dict[str, Any] | {} | Free-form additional metadata. |
class Document(BaseModel): id: str path: str folder: str lineage: list[str] = [] type: str topics: list[str] = [] tags: list[str] = [] summary: Optional[str] = None chunk_ids: list[str] = [] references: list[Relation] = [] referenced_by: list[Relation] = [] metadata: dict[str, Any] = {}SpaceStats
Section titled “SpaceStats”Aggregate counts surfaced via space.stats. The same shape appears under the stats key of index.json and is what indx inspect --json emits.
| Field | Type | Default | Description |
|---|---|---|---|
documents | int | — (required) | Number of documents. |
chunks | int | — (required) | Number of chunks. |
relations | int | — (required) | Number of relations. |
embeddings | int | — (required) | Number of stored vectors. |
embed_dim | int | None | None | Vector dimensionality, e.g. 1024 for bge-m3. |
types | dict[str, int] | {} | Document count per detected type. |
bytes_source | int | 0 | Total bytes of source material walked. |
class SpaceStats(BaseModel): documents: int chunks: int relations: int embeddings: int embed_dim: Optional[int] = None types: dict[str, int] = {} bytes_source: int = 0SearchHit
Section titled “SearchHit”A single result from space.search(...). It exposes the matched chunk, its neighbor chunks (resolved into full Chunk objects for context windows), and a convenience source property.
| Field | Type | Default | Description |
|---|---|---|---|
chunk | Chunk | — (required) | The matched chunk. |
score | float | — (required) | Similarity score; higher is better. |
neighbors | list[Chunk] | [] | Resolved neighbor chunks of chunk. |
source (property) | Source | derived | Provenance of the matched chunk — shorthand for hit.chunk.source. |
class SearchHit(BaseModel): chunk: Chunk score: float neighbors: list[Chunk] = []
@property def source(self) -> Source: return self.chunk.sourceKnowledgeSpace
Section titled “KnowledgeSpace”The top-level result of processing a directory. It holds the document graph, chunks, relations, and metadata, and serializes to a single portable .indx archive. Beyond its data fields, KnowledgeSpace provides first-class accessors for stats, document filtering, semantic search, and load/save.
| Field | Type | Default | Description |
|---|---|---|---|
version | str | "1.0" | Knowledge-space schema version. |
root | str | — (required) | Absolute path of the walked directory / ZIP. |
documents | list[Document] | [] | The document graph. (Stored internally as documents_ with a property shim; the public callable is documents(type=...) below.) |
chunks | list[Chunk] | [] | All chunks in the space. |
relations | list[Relation] | [] | Graph-level edges (optional mirror of per-object edges). |
metadata | dict[str, Any] | {} | Tool version, config snapshot, build time, and any errors. |
Accessors
Section titled “Accessors”space.stats # -> SpaceStatsspace.documents(type="policy") # -> list[Document], optionally filteredspace.search("how long is data retained?", k=5) # -> list[SearchHit]
KnowledgeSpace.load(archive) # classmethod -> KnowledgeSpacespace.save(archive) # -> NoneFull signatures, behavior, and examples for these methods live in the SDK reference.
SpaceContext
Section titled “SpaceContext”The shared, mutable carrier threaded through every pipeline stage. Each stage receives this object and returns the same object, mutated — run(ctx: SpaceContext) -> SpaceContext. Earlier stages populate collections that later stages read; see Pipeline and stages for the full flow.
SpaceContext sets model_config = {"arbitrary_types_allowed": True} because it holds bound component instances (the resolved protocols).
Inputs
Section titled “Inputs”| Field | Type | Default | Description |
|---|---|---|---|
root | str | — (required) | Path being processed. |
out | str | None | None | Output directory for stage 06. |
config | IndxConfig | — (required) | Resolved indx.toml configuration. |
Bound components
Section titled “Bound components”Resolved before the run begins (see registry and defaults):
| Field | Type | Default | Description |
|---|---|---|---|
parser | Parser | — (required) | Bound parser. |
llm | LLM | None | None | Bound enrichment LLM, if any. |
vlm | VLM | None | None | Bound vision-language model, if any. |
embedder | Embedder | — (required) | Bound embedder. |
store | Store | — (required) | Bound vector store. |
writer | OutputWriter | — (required) | Bound output writer. |
Accumulated work
Section titled “Accumulated work”Populated stage by stage:
| Field | Type | Default | Description |
|---|---|---|---|
dir_graph | dict[str, Any] | {} | 01 Walk — folder→children plus detected types. |
parsed | dict[str, ParsedDoc] | {} | 02 Parse — doc_id → ParsedDoc. |
documents | list[Document] | [] | Document graph (built across stages 01–05). |
chunks | list[Chunk] | [] | 03 Chunk onward. |
relations | list[Relation] | [] | 04 Relate. |
embeddings | dict[str, list[float]] | {} | 06 Embed — chunk_id → vector. |
Diagnostics
Section titled “Diagnostics”| Field | Type | Default | Description |
|---|---|---|---|
errors | list[StageError] | [] | Non-fatal per-item failures; surfaced on the space under space.metadata["errors"]. |
Materializing the result
Section titled “Materializing the result”def to_space(self) -> KnowledgeSpace: """Materialize the accumulated context into a KnowledgeSpace."""ctx.to_space() collapses the accumulated graph, chunks, relations, and metadata into the immutable KnowledgeSpace that the pipeline returns and the .indx archive seals.
See also
Section titled “See also”- Core objects — conceptual overview of how these models relate.
index.jsonschema — the on-disk serialized form of the graph.- Component protocols — the typed interfaces bound into
SpaceContext. - SDK reference — full method signatures for the
KnowledgeSpaceaccessors.