Data Models

indx’s core domain types are Pydantic v2 models that travel through the pipeline, serialize into index.json, and seal into a portable .indx archive. This page documents every field of every model. For a gentler, concept-level tour of how these objects fit together, see Core objects.

Conventions

A few rules apply across all models:

Identifiers are zero-padded, stable strings. Chunks are chunk_0481, documents are doc_0007. Ids are assigned in deterministic traversal order (folder lineage, then path, then in-document index), so re-running over unchanged input yields identical ids.
Vectors are list[float] — 32-bit floats in the on-disk matrix, surfaced as plain Python floats in memory.
Free-form metadata is typed dict[str, Any] and serialized with sorted keys for diff-friendly output.

RelationType

Typed graph edges between documents and/or chunks. RelationType is a str enum, so its members serialize to their string values.

class RelationType(str, Enum):
    SIBLING       = "sibling"        # same folder / same logical group
    PARENT        = "parent"         # folder lineage / containment
    REFERENCES    = "references"     # outgoing citation, link, or mention
    CONTINUES     = "continues"      # next unit in a split sequence
    DUPLICATE_OF  = "duplicate-of"   # near/exact duplicate content

Value	Meaning
`sibling`	Same folder or same logical group.
`parent`	Folder lineage / containment.
`references`	Outgoing citation, link, or mention.
`continues`	Next unit in a split sequence.
`duplicate-of`	Near or exact duplicate content.

Source

Provenance of a chunk or parsed unit — where the content came from in the walked tree.

Field	Type	Description
`path`	`str`	Original file path, relative to the walked root. Required.
`folder`	`str`	Containing folder, relative to root. Required.
`type`	`str`	Detected or enriched document type, e.g. `"policy"`. Required.

class Source(BaseModel):
    path:   str
    folder: str
    type:   str

Relation

A typed, directed edge in the knowledge graph. A Relation may connect chunk→document, document→document, or chunk→chunk. When the edge is stored on its owning object (the owner is the implicit source), from_id is omitted.

Field	Type	Default	Description
`type`	`RelationType`	— (required)	The kind of edge.
`to`	`str`	— (required)	Target id or path, e.g. `"legal/gdpr.md"` or `"chunk_0482"`.
`from_id`	`str \| None`	`None`	Source id; implicit (omitted) when stored on the source object.
`weight`	`float \| None`	`None`	Confidence / similarity score in `[0, 1]` where applicable.

class Relation(BaseModel):
    type:    RelationType
    to:      str
    from_id: Optional[str]   = None
    weight:  Optional[float] = None

ParsedDoc

The raw output of a Parser for a single source file, produced in stage 02 Parse before chunking. It carries normalized text plus structured artifacts that the 03 Chunk stage uses to split with structure intact. Returned by Parser.parse(file) -> ParsedDoc.

Field	Type	Default	Description
`source`	`Source`	— (required)	Provenance of this parsed unit.
`text`	`str`	— (required)	Normalized full-text rendering.
`blocks`	`list[dict[str, Any]]`	`[]`	Structural blocks: headings, paragraphs, list items.
`tables`	`list[dict[str, Any]]`	`[]`	Extracted tables (rows / cells / markdown).
`images`	`list[dict[str, Any]]`	`[]`	Image refs and captions for VLM enrichment.
`metadata`	`dict[str, Any]`	`{}`	Parser-supplied raw metadata (title, author, page count).

class ParsedDoc(BaseModel):
    source:   Source
    text:     str
    blocks:   list[dict[str, Any]] = []
    tables:   list[dict[str, Any]] = []
    images:   list[dict[str, Any]] = []
    metadata: dict[str, Any]       = {}

Chunk

A retrievable unit of content. A Chunk remembers its source document, its position within that document, the ids of its immediate neighbor chunks, and any typed relations. The embedding vector is populated in stage 06 Embed+Pack and is typically not inlined into index.json — it lives in embeddings/.

Field	Type	Default	Description
`id`	`str`	— (required)	Stable id, e.g. `"chunk_0481"`.
`text`	`str`	— (required)	The retrievable text payload.
`source`	`Source`	— (required)	Originating document provenance.
`doc_id`	`str`	— (required)	Id of the parent `Document`.
`index`	`int`	— (required)	0-based position within the parent document.
`metadata`	`dict[str, Any]`	`{}`	Enriched values: topics, summary, tags.
`neighbors`	`list[str]`	`[]`	Adjacent chunk ids (previous, next).
`relations`	`list[Relation]`	`[]`	Outgoing typed edges from this chunk.
`embedding`	`list[float] \| None`	`None`	Vector, populated in stage 06. May be omitted from `index.json`.

class Chunk(BaseModel):
    id:        str
    text:      str
    source:    Source
    doc_id:    str
    index:     int
    metadata:  dict[str, Any]        = {}
    neighbors: list[str]             = []
    relations: list[Relation]        = []
    embedding: Optional[list[float]] = None

Document

One source file, enriched. A Document holds folder lineage, detected type, and the LLM-derived topics / tags / summary from stage 05 Enrich, plus the resolved set of outgoing and incoming references from stage 04 Relate.

Field	Type	Default	Description
`id`	`str`	— (required)	Stable id, e.g. `"doc_0007"`.
`path`	`str`	— (required)	Original path relative to root.
`folder`	`str`	— (required)	Containing folder (lineage segment).
`lineage`	`list[str]`	`[]`	Folder ancestry, root→leaf.
`type`	`str`	— (required)	Detected / enriched document type.
`topics`	`list[str]`	`[]`	Enrichment-derived topics.
`tags`	`list[str]`	`[]`	Enrichment-derived tags.
`summary`	`str \| None`	`None`	Enrichment-derived summary.
`chunk_ids`	`list[str]`	`[]`	Chunks produced from this document, in order.
`references`	`list[Relation]`	`[]`	Outgoing references resolved in stage 04.
`referenced_by`	`list[Relation]`	`[]`	Incoming references (reverse edges).
`metadata`	`dict[str, Any]`	`{}`	Free-form additional metadata.

class Document(BaseModel):
    id:            str
    path:          str
    folder:        str
    lineage:       list[str]     = []
    type:          str
    topics:        list[str]     = []
    tags:          list[str]     = []
    summary:       Optional[str] = None
    chunk_ids:     list[str]     = []
    references:    list[Relation] = []
    referenced_by: list[Relation] = []
    metadata:      dict[str, Any] = {}

SpaceStats

Aggregate counts surfaced via space.stats. The same shape appears under the stats key of index.json and is what indx inspect --json emits.

Field	Type	Default	Description
`documents`	`int`	— (required)	Number of documents.
`chunks`	`int`	— (required)	Number of chunks.
`relations`	`int`	— (required)	Number of relations.
`embeddings`	`int`	— (required)	Number of stored vectors.
`embed_dim`	`int \| None`	`None`	Vector dimensionality, e.g. `1024` for `bge-m3`.
`types`	`dict[str, int]`	`{}`	Document count per detected type.
`bytes_source`	`int`	`0`	Total bytes of source material walked.

class SpaceStats(BaseModel):
    documents:    int
    chunks:       int
    relations:    int
    embeddings:   int
    embed_dim:    Optional[int]  = None
    types:        dict[str, int] = {}
    bytes_source: int            = 0

SearchHit

A single result from space.search(...). It exposes the matched chunk, its neighbor chunks (resolved into full Chunk objects for context windows), and a convenience source property.

Field	Type	Default	Description
`chunk`	`Chunk`	— (required)	The matched chunk.
`score`	`float`	— (required)	Similarity score; higher is better.
`neighbors`	`list[Chunk]`	`[]`	Resolved neighbor chunks of `chunk`.
`source` (property)	`Source`	derived	Provenance of the matched chunk — shorthand for `hit.chunk.source`.

class SearchHit(BaseModel):
    chunk:     Chunk
    score:     float
    neighbors: list[Chunk] = []

    @property
    def source(self) -> Source:
        return self.chunk.source

KnowledgeSpace

The top-level result of processing a directory. It holds the document graph, chunks, relations, and metadata, and serializes to a single portable .indx archive. Beyond its data fields, KnowledgeSpace provides first-class accessors for stats, document filtering, semantic search, and load/save.

Field	Type	Default	Description
`version`	`str`	`"1.0"`	Knowledge-space schema version.
`root`	`str`	— (required)	Absolute path of the walked directory / ZIP.
`documents`	`list[Document]`	`[]`	The document graph. (Stored internally as `documents_` with a property shim; the public callable is `documents(type=...)` below.)
`chunks`	`list[Chunk]`	`[]`	All chunks in the space.
`relations`	`list[Relation]`	`[]`	Graph-level edges (optional mirror of per-object edges).
`metadata`	`dict[str, Any]`	`{}`	Tool version, config snapshot, build time, and any `errors`.

Accessors

space.stats                       # -> SpaceStats
space.documents(type="policy")    # -> list[Document], optionally filtered
space.search("how long is data retained?", k=5)  # -> list[SearchHit]

KnowledgeSpace.load(archive)      # classmethod -> KnowledgeSpace
space.save(archive)               # -> None

Full signatures, behavior, and examples for these methods live in the SDK reference.

SpaceContext

The shared, mutable carrier threaded through every pipeline stage. Each stage receives this object and returns the same object, mutated — run(ctx: SpaceContext) -> SpaceContext. Earlier stages populate collections that later stages read; see Pipeline and stages for the full flow.

SpaceContext sets model_config = {"arbitrary_types_allowed": True} because it holds bound component instances (the resolved protocols).

Inputs

Field	Type	Default	Description
`root`	`str`	— (required)	Path being processed.
`out`	`str \| None`	`None`	Output directory for stage 06.
`config`	`IndxConfig`	— (required)	Resolved `indx.toml` configuration.

Bound components

Resolved before the run begins (see registry and defaults):

Field	Type	Default	Description
`parser`	`Parser`	— (required)	Bound parser.
`llm`	`LLM \| None`	`None`	Bound enrichment LLM, if any.
`vlm`	`VLM \| None`	`None`	Bound vision-language model, if any.
`embedder`	`Embedder`	— (required)	Bound embedder.
`store`	`Store`	— (required)	Bound vector store.
`writer`	`OutputWriter`	— (required)	Bound output writer.

Accumulated work

Populated stage by stage:

Field	Type	Default	Description
`dir_graph`	`dict[str, Any]`	`{}`	01 Walk — folder→children plus detected types.
`parsed`	`dict[str, ParsedDoc]`	`{}`	02 Parse — `doc_id` → `ParsedDoc`.
`documents`	`list[Document]`	`[]`	Document graph (built across stages 01–05).
`chunks`	`list[Chunk]`	`[]`	03 Chunk onward.
`relations`	`list[Relation]`	`[]`	04 Relate.
`embeddings`	`dict[str, list[float]]`	`{}`	06 Embed — `chunk_id` → vector.

Diagnostics

Field	Type	Default	Description
`errors`	`list[StageError]`	`[]`	Non-fatal per-item failures; surfaced on the space under `space.metadata["errors"]`.

Materializing the result

def to_space(self) -> KnowledgeSpace:
    """Materialize the accumulated context into a KnowledgeSpace."""

ctx.to_space() collapses the accumulated graph, chunks, relations, and metadata into the immutable KnowledgeSpace that the pipeline returns and the .indx archive seals.