Skip to content

Data Models

indx’s core domain types are Pydantic v2 models that travel through the pipeline, serialize into index.json, and seal into a portable .indx archive. This page documents every field of every model. For a gentler, concept-level tour of how these objects fit together, see Core objects.

A few rules apply across all models:

  • Identifiers are zero-padded, stable strings. Chunks are chunk_0481, documents are doc_0007. Ids are assigned in deterministic traversal order (folder lineage, then path, then in-document index), so re-running over unchanged input yields identical ids.
  • Vectors are list[float] — 32-bit floats in the on-disk matrix, surfaced as plain Python floats in memory.
  • Free-form metadata is typed dict[str, Any] and serialized with sorted keys for diff-friendly output.

Typed graph edges between documents and/or chunks. RelationType is a str enum, so its members serialize to their string values.

class RelationType(str, Enum):
SIBLING = "sibling" # same folder / same logical group
PARENT = "parent" # folder lineage / containment
REFERENCES = "references" # outgoing citation, link, or mention
CONTINUES = "continues" # next unit in a split sequence
DUPLICATE_OF = "duplicate-of" # near/exact duplicate content
ValueMeaning
siblingSame folder or same logical group.
parentFolder lineage / containment.
referencesOutgoing citation, link, or mention.
continuesNext unit in a split sequence.
duplicate-ofNear or exact duplicate content.

Provenance of a chunk or parsed unit — where the content came from in the walked tree.

FieldTypeDescription
pathstrOriginal file path, relative to the walked root. Required.
folderstrContaining folder, relative to root. Required.
typestrDetected or enriched document type, e.g. "policy". Required.
class Source(BaseModel):
path: str
folder: str
type: str

A typed, directed edge in the knowledge graph. A Relation may connect chunk→document, document→document, or chunk→chunk. When the edge is stored on its owning object (the owner is the implicit source), from_id is omitted.

FieldTypeDefaultDescription
typeRelationType— (required)The kind of edge.
tostr— (required)Target id or path, e.g. "legal/gdpr.md" or "chunk_0482".
from_idstr | NoneNoneSource id; implicit (omitted) when stored on the source object.
weightfloat | NoneNoneConfidence / similarity score in [0, 1] where applicable.
class Relation(BaseModel):
type: RelationType
to: str
from_id: Optional[str] = None
weight: Optional[float] = None

The raw output of a Parser for a single source file, produced in stage 02 Parse before chunking. It carries normalized text plus structured artifacts that the 03 Chunk stage uses to split with structure intact. Returned by Parser.parse(file) -> ParsedDoc.

FieldTypeDefaultDescription
sourceSource— (required)Provenance of this parsed unit.
textstr— (required)Normalized full-text rendering.
blockslist[dict[str, Any]][]Structural blocks: headings, paragraphs, list items.
tableslist[dict[str, Any]][]Extracted tables (rows / cells / markdown).
imageslist[dict[str, Any]][]Image refs and captions for VLM enrichment.
metadatadict[str, Any]{}Parser-supplied raw metadata (title, author, page count).
class ParsedDoc(BaseModel):
source: Source
text: str
blocks: list[dict[str, Any]] = []
tables: list[dict[str, Any]] = []
images: list[dict[str, Any]] = []
metadata: dict[str, Any] = {}

A retrievable unit of content. A Chunk remembers its source document, its position within that document, the ids of its immediate neighbor chunks, and any typed relations. The embedding vector is populated in stage 06 Embed+Pack and is typically not inlined into index.json — it lives in embeddings/.

FieldTypeDefaultDescription
idstr— (required)Stable id, e.g. "chunk_0481".
textstr— (required)The retrievable text payload.
sourceSource— (required)Originating document provenance.
doc_idstr— (required)Id of the parent Document.
indexint— (required)0-based position within the parent document.
metadatadict[str, Any]{}Enriched values: topics, summary, tags.
neighborslist[str][]Adjacent chunk ids (previous, next).
relationslist[Relation][]Outgoing typed edges from this chunk.
embeddinglist[float] | NoneNoneVector, populated in stage 06. May be omitted from index.json.
class Chunk(BaseModel):
id: str
text: str
source: Source
doc_id: str
index: int
metadata: dict[str, Any] = {}
neighbors: list[str] = []
relations: list[Relation] = []
embedding: Optional[list[float]] = None

One source file, enriched. A Document holds folder lineage, detected type, and the LLM-derived topics / tags / summary from stage 05 Enrich, plus the resolved set of outgoing and incoming references from stage 04 Relate.

FieldTypeDefaultDescription
idstr— (required)Stable id, e.g. "doc_0007".
pathstr— (required)Original path relative to root.
folderstr— (required)Containing folder (lineage segment).
lineagelist[str][]Folder ancestry, root→leaf.
typestr— (required)Detected / enriched document type.
topicslist[str][]Enrichment-derived topics.
tagslist[str][]Enrichment-derived tags.
summarystr | NoneNoneEnrichment-derived summary.
chunk_idslist[str][]Chunks produced from this document, in order.
referenceslist[Relation][]Outgoing references resolved in stage 04.
referenced_bylist[Relation][]Incoming references (reverse edges).
metadatadict[str, Any]{}Free-form additional metadata.
class Document(BaseModel):
id: str
path: str
folder: str
lineage: list[str] = []
type: str
topics: list[str] = []
tags: list[str] = []
summary: Optional[str] = None
chunk_ids: list[str] = []
references: list[Relation] = []
referenced_by: list[Relation] = []
metadata: dict[str, Any] = {}

Aggregate counts surfaced via space.stats. The same shape appears under the stats key of index.json and is what indx inspect --json emits.

FieldTypeDefaultDescription
documentsint— (required)Number of documents.
chunksint— (required)Number of chunks.
relationsint— (required)Number of relations.
embeddingsint— (required)Number of stored vectors.
embed_dimint | NoneNoneVector dimensionality, e.g. 1024 for bge-m3.
typesdict[str, int]{}Document count per detected type.
bytes_sourceint0Total bytes of source material walked.
class SpaceStats(BaseModel):
documents: int
chunks: int
relations: int
embeddings: int
embed_dim: Optional[int] = None
types: dict[str, int] = {}
bytes_source: int = 0

A single result from space.search(...). It exposes the matched chunk, its neighbor chunks (resolved into full Chunk objects for context windows), and a convenience source property.

FieldTypeDefaultDescription
chunkChunk— (required)The matched chunk.
scorefloat— (required)Similarity score; higher is better.
neighborslist[Chunk][]Resolved neighbor chunks of chunk.
source (property)SourcederivedProvenance of the matched chunk — shorthand for hit.chunk.source.
class SearchHit(BaseModel):
chunk: Chunk
score: float
neighbors: list[Chunk] = []
@property
def source(self) -> Source:
return self.chunk.source

The top-level result of processing a directory. It holds the document graph, chunks, relations, and metadata, and serializes to a single portable .indx archive. Beyond its data fields, KnowledgeSpace provides first-class accessors for stats, document filtering, semantic search, and load/save.

FieldTypeDefaultDescription
versionstr"1.0"Knowledge-space schema version.
rootstr— (required)Absolute path of the walked directory / ZIP.
documentslist[Document][]The document graph. (Stored internally as documents_ with a property shim; the public callable is documents(type=...) below.)
chunkslist[Chunk][]All chunks in the space.
relationslist[Relation][]Graph-level edges (optional mirror of per-object edges).
metadatadict[str, Any]{}Tool version, config snapshot, build time, and any errors.
space.stats # -> SpaceStats
space.documents(type="policy") # -> list[Document], optionally filtered
space.search("how long is data retained?", k=5) # -> list[SearchHit]
KnowledgeSpace.load(archive) # classmethod -> KnowledgeSpace
space.save(archive) # -> None

Full signatures, behavior, and examples for these methods live in the SDK reference.

The shared, mutable carrier threaded through every pipeline stage. Each stage receives this object and returns the same object, mutated — run(ctx: SpaceContext) -> SpaceContext. Earlier stages populate collections that later stages read; see Pipeline and stages for the full flow.

SpaceContext sets model_config = {"arbitrary_types_allowed": True} because it holds bound component instances (the resolved protocols).

FieldTypeDefaultDescription
rootstr— (required)Path being processed.
outstr | NoneNoneOutput directory for stage 06.
configIndxConfig— (required)Resolved indx.toml configuration.

Resolved before the run begins (see registry and defaults):

FieldTypeDefaultDescription
parserParser— (required)Bound parser.
llmLLM | NoneNoneBound enrichment LLM, if any.
vlmVLM | NoneNoneBound vision-language model, if any.
embedderEmbedder— (required)Bound embedder.
storeStore— (required)Bound vector store.
writerOutputWriter— (required)Bound output writer.

Populated stage by stage:

FieldTypeDefaultDescription
dir_graphdict[str, Any]{}01 Walk — folder→children plus detected types.
parseddict[str, ParsedDoc]{}02 Parsedoc_idParsedDoc.
documentslist[Document][]Document graph (built across stages 01–05).
chunkslist[Chunk][]03 Chunk onward.
relationslist[Relation][]04 Relate.
embeddingsdict[str, list[float]]{}06 Embedchunk_id → vector.
FieldTypeDefaultDescription
errorslist[StageError][]Non-fatal per-item failures; surfaced on the space under space.metadata["errors"].
def to_space(self) -> KnowledgeSpace:
"""Materialize the accumulated context into a KnowledgeSpace."""

ctx.to_space() collapses the accumulated graph, chunks, relations, and metadata into the immutable KnowledgeSpace that the pipeline returns and the .indx archive seals.