Core Objects
indx has a small, deliberate vocabulary of data models. Learn these six objects and you understand the whole tool: a pipeline reads a directory and assembles them into a single, portable result you can query. This page is the conceptual tour — for every field, constraint, and serialized shape, see the data-models reference.
All core objects are Pydantic v2 models. Identifiers are stable, zero-padded strings (doc_0007, chunk_0481) assigned in a deterministic traversal order, so re-running over unchanged input yields identical ids.
The big picture
Section titled “The big picture”A run produces one KnowledgeSpace. Inside it lives a graph of Documents (one per source file) and a flat list of Chunks (the retrievable units). Relations are the typed edges that wire documents and chunks together. Every chunk and document carries a Source describing where it came from. While the pipeline is running, a single mutable SpaceContext carries the half-built work from one stage to the next, and is finally materialized into the KnowledgeSpace.
KnowledgeSpace (top-level result → .indx archive)├── documents: Document[] one enriched source file each│ ├── Source provenance (path, folder, type)│ ├── chunk_ids ─────────┐ links into the chunk list│ └── references / referenced_by : Relation[]├── chunks: Chunk[] ◀───┘ retrievable units│ ├── Source│ ├── neighbors adjacent chunk ids│ ├── relations: Relation[]│ └── embedding vector (stage 06) — in-memory; serialized to embeddings/vectors.f32, never inlined in index.json└── relations: Relation[] graph-level edgesKnowledgeSpace
Section titled “KnowledgeSpace”The top-level result of processing a directory. It holds the document graph, the chunks, the embeddings, and run metadata, and it serializes to a single portable .indx archive that re-loads without re-processing.
Most important attributes:
root— absolute path of the walked directory or ZIP.documents— theDocumentgraph (exposed as a callable accessor; see below).chunks— everyChunkin the space.relations— graph-level edges (a mirror of the per-object edges).metadata— tool version, resolved config snapshot, build time, and any non-fatalerrors.
Beyond the data, KnowledgeSpace gives you a first-class API so you rarely touch the raw fields:
from indx import KnowledgeSpace
space = KnowledgeSpace.load("./ai-ready/handbook.indx") # filename is <name>.indx — default handbook, set with --name
space.stats # SpaceStats: counts, embed_dim, type histogramspace.documents(type="policy") # filter the document graph by detected typespace.search("data retention", k=5) # semantic search → list[SearchHit]space.save("./copy.indx") # seal back to a portable archivestatsreturns aSpaceStatswith document/chunk/relation/embedding counts,embed_dim, and a per-type histogram.documents(type=...)returns the documents, optionally filtered by detected type.search(query, k=5)embeds the query and returns the top-kchunks asSearchHits, each with its resolved neighbor chunks and source.load(path)/save(path)open and seal.indxarchives.
Document
Section titled “Document”One source file, enriched. A Document is where indx records everything a file-level parser throws away — folder lineage, document type, and cross-document references — alongside LLM-derived semantics.
Most important attributes:
id— stable id, e.g.doc_0007.path/folder— original location relative to the walked root.lineage— the folder ancestry root→leaf (e.g.["policies", "policies/data"]), so agents can filter and reason by location.type— detected/enriched document type (e.g.policy,guide), which drives type-aware enrichment.topics,tags,summary— semantic metadata added during Enrich.chunk_ids— the chunks produced from this document, in order.references/referenced_by— outgoing and incomingRelationedges resolved during Relate.
A Document does not embed its chunks; it links to them by id through chunk_ids. Resolve those against space.chunks to walk from a document to its retrievable content.
The retrievable unit — the thing your RAG system or agent actually fetches. Every chunk remembers exactly where it sits, so retrieval never returns an orphaned fragment.
Most important attributes:
id— stable id, e.g.chunk_0481.text— the retrievable text payload.source— the originatingSource(path, folder, type).doc_id— id of the parentDocument.index— 0-based position within that document.metadata— enriched topics, summary, and tags.neighbors— adjacent chunk ids (previous and next), for expanding context windows.relations— outgoing typed edges from this chunk.embedding— the vector, populated in Embed+Pack. It is an in-memory field that is never inlined into the chunk JSON — vectors are serialized separately to the archive’sembeddings/vectors.f32(see the.indxarchive reference), soembeddingis omitted fromindex.jsonand isNoneuntil embedding runs.
Because a chunk carries doc_id, index, neighbors, and relations, an agent can climb back to the parent document, pull adjacent chunks for context, or follow a continues/references edge — instead of reasoning over a flat bag of text.
Relation
Section titled “Relation”A typed, directed edge in the knowledge graph. A Relation may connect document→document, chunk→chunk, or chunk→document. This is how indx captures the structure that file-by-file parsing destroys.
Attributes:
type— one of the fiveRelationTypevalues.to— the target id or path (e.g.legal/gdpr.mdorchunk_0482).from_id— optional source id; omitted when the edge is stored on its owning object (the owner is the implicit source).weight— optional confidence/similarity score in[0, 1], where applicable.
The five relation types:
RelationType | Meaning |
|---|---|
sibling | Same folder / same logical group. |
parent | Folder lineage / containment. |
references | An outgoing citation, link, or mention. |
continues | The next unit in a split sequence. |
duplicate-of | Near or exact duplicate content. |
Relations appear in three places: on a chunk’s relations, on a document’s references / referenced_by, and (as an optional mirror) on the space-level relations list. See Relate for how each type is derived.
Source
Section titled “Source”The lightweight provenance record attached to every chunk and parsed unit. It is what makes results traceable back to disk.
Attributes:
path— original file path relative to the walked root.folder— containing folder, relative to root.type— detected/enriched document type (e.g.policy).
Source is small by design: it travels with chunks and ParsedDocs so that a SearchHit can expose hit.source.path without re-resolving the parent document.
SpaceContext
Section titled “SpaceContext”The shared, mutable carrier threaded through every pipeline stage. Where the objects above are the result, SpaceContext is the work in progress. Each stage obeys the contract run(ctx: SpaceContext) -> SpaceContext and returns the same object it received, mutated in place — appending to the collections relevant to its phase while reading what earlier stages produced.
What it carries:
- Inputs —
root,out, and the resolvedconfig. - Bound components —
parser,llm,vlm,embedder,store,writer(the resolved protocol implementations). - Accumulated work —
dir_graph(Walk),parsed(Parse),documents,chunks(Chunk onward),relations(Relate), andembeddings(Embed). - Diagnostics —
errors, a list of non-fatal per-item failures surfaced later underspace.metadata["errors"].
When the pipeline finishes, ctx.to_space() materializes the context into the final KnowledgeSpace.
How they fit together: a quick walk
Section titled “How they fit together: a quick walk”Conceptually, the models nest like this:
from indx import KnowledgeSpace, Document, Chunk, Relation, Source
space = KnowledgeSpace.load("./ai-ready/handbook.indx")
doc: Document = space.documents(type="policy")[0]print(doc.path, doc.lineage, doc.topics) # provenance + semantics
# Walk from a document to its chunks via chunk_ids.chunks = [c for c in space.chunks if c.id in doc.chunk_ids]first: Chunk = chunks[0]print(first.text, first.neighbors) # text + adjacent ids
# Every chunk carries a Source and typed Relations.src: Source = first.sourcefor rel in first.relations: rel # Relation(type="references", to="legal/gdpr.md")From here, deepen your understanding with the pipeline and stages concept, the bring-your-own-stack model, or the complete data-models reference.