Skip to content

Core Objects

indx has a small, deliberate vocabulary of data models. Learn these six objects and you understand the whole tool: a pipeline reads a directory and assembles them into a single, portable result you can query. This page is the conceptual tour — for every field, constraint, and serialized shape, see the data-models reference.

All core objects are Pydantic v2 models. Identifiers are stable, zero-padded strings (doc_0007, chunk_0481) assigned in a deterministic traversal order, so re-running over unchanged input yields identical ids.

A run produces one KnowledgeSpace. Inside it lives a graph of Documents (one per source file) and a flat list of Chunks (the retrievable units). Relations are the typed edges that wire documents and chunks together. Every chunk and document carries a Source describing where it came from. While the pipeline is running, a single mutable SpaceContext carries the half-built work from one stage to the next, and is finally materialized into the KnowledgeSpace.

KnowledgeSpace (top-level result → .indx archive)
├── documents: Document[] one enriched source file each
│ ├── Source provenance (path, folder, type)
│ ├── chunk_ids ─────────┐ links into the chunk list
│ └── references / referenced_by : Relation[]
├── chunks: Chunk[] ◀───┘ retrievable units
│ ├── Source
│ ├── neighbors adjacent chunk ids
│ ├── relations: Relation[]
│ └── embedding vector (stage 06) — in-memory; serialized to embeddings/vectors.f32, never inlined in index.json
└── relations: Relation[] graph-level edges

The top-level result of processing a directory. It holds the document graph, the chunks, the embeddings, and run metadata, and it serializes to a single portable .indx archive that re-loads without re-processing.

Most important attributes:

  • root — absolute path of the walked directory or ZIP.
  • documents — the Document graph (exposed as a callable accessor; see below).
  • chunks — every Chunk in the space.
  • relations — graph-level edges (a mirror of the per-object edges).
  • metadata — tool version, resolved config snapshot, build time, and any non-fatal errors.

Beyond the data, KnowledgeSpace gives you a first-class API so you rarely touch the raw fields:

from indx import KnowledgeSpace
space = KnowledgeSpace.load("./ai-ready/handbook.indx") # filename is <name>.indx — default handbook, set with --name
space.stats # SpaceStats: counts, embed_dim, type histogram
space.documents(type="policy") # filter the document graph by detected type
space.search("data retention", k=5) # semantic search → list[SearchHit]
space.save("./copy.indx") # seal back to a portable archive
  • stats returns a SpaceStats with document/chunk/relation/embedding counts, embed_dim, and a per-type histogram.
  • documents(type=...) returns the documents, optionally filtered by detected type.
  • search(query, k=5) embeds the query and returns the top-k chunks as SearchHits, each with its resolved neighbor chunks and source.
  • load(path) / save(path) open and seal .indx archives.

One source file, enriched. A Document is where indx records everything a file-level parser throws away — folder lineage, document type, and cross-document references — alongside LLM-derived semantics.

Most important attributes:

  • id — stable id, e.g. doc_0007.
  • path / folder — original location relative to the walked root.
  • lineage — the folder ancestry root→leaf (e.g. ["policies", "policies/data"]), so agents can filter and reason by location.
  • type — detected/enriched document type (e.g. policy, guide), which drives type-aware enrichment.
  • topics, tags, summary — semantic metadata added during Enrich.
  • chunk_ids — the chunks produced from this document, in order.
  • references / referenced_by — outgoing and incoming Relation edges resolved during Relate.

A Document does not embed its chunks; it links to them by id through chunk_ids. Resolve those against space.chunks to walk from a document to its retrievable content.

The retrievable unit — the thing your RAG system or agent actually fetches. Every chunk remembers exactly where it sits, so retrieval never returns an orphaned fragment.

Most important attributes:

  • id — stable id, e.g. chunk_0481.
  • text — the retrievable text payload.
  • source — the originating Source (path, folder, type).
  • doc_id — id of the parent Document.
  • index — 0-based position within that document.
  • metadata — enriched topics, summary, and tags.
  • neighbors — adjacent chunk ids (previous and next), for expanding context windows.
  • relations — outgoing typed edges from this chunk.
  • embedding — the vector, populated in Embed+Pack. It is an in-memory field that is never inlined into the chunk JSON — vectors are serialized separately to the archive’s embeddings/vectors.f32 (see the .indx archive reference), so embedding is omitted from index.json and is None until embedding runs.

Because a chunk carries doc_id, index, neighbors, and relations, an agent can climb back to the parent document, pull adjacent chunks for context, or follow a continues/references edge — instead of reasoning over a flat bag of text.

A typed, directed edge in the knowledge graph. A Relation may connect document→document, chunk→chunk, or chunk→document. This is how indx captures the structure that file-by-file parsing destroys.

Attributes:

  • type — one of the five RelationType values.
  • to — the target id or path (e.g. legal/gdpr.md or chunk_0482).
  • from_id — optional source id; omitted when the edge is stored on its owning object (the owner is the implicit source).
  • weight — optional confidence/similarity score in [0, 1], where applicable.

The five relation types:

RelationTypeMeaning
siblingSame folder / same logical group.
parentFolder lineage / containment.
referencesAn outgoing citation, link, or mention.
continuesThe next unit in a split sequence.
duplicate-ofNear or exact duplicate content.

Relations appear in three places: on a chunk’s relations, on a document’s references / referenced_by, and (as an optional mirror) on the space-level relations list. See Relate for how each type is derived.

The lightweight provenance record attached to every chunk and parsed unit. It is what makes results traceable back to disk.

Attributes:

  • path — original file path relative to the walked root.
  • folder — containing folder, relative to root.
  • type — detected/enriched document type (e.g. policy).

Source is small by design: it travels with chunks and ParsedDocs so that a SearchHit can expose hit.source.path without re-resolving the parent document.

The shared, mutable carrier threaded through every pipeline stage. Where the objects above are the result, SpaceContext is the work in progress. Each stage obeys the contract run(ctx: SpaceContext) -> SpaceContext and returns the same object it received, mutated in place — appending to the collections relevant to its phase while reading what earlier stages produced.

What it carries:

  • Inputsroot, out, and the resolved config.
  • Bound componentsparser, llm, vlm, embedder, store, writer (the resolved protocol implementations).
  • Accumulated workdir_graph (Walk), parsed (Parse), documents, chunks (Chunk onward), relations (Relate), and embeddings (Embed).
  • Diagnosticserrors, a list of non-fatal per-item failures surfaced later under space.metadata["errors"].

When the pipeline finishes, ctx.to_space() materializes the context into the final KnowledgeSpace.

Conceptually, the models nest like this:

from indx import KnowledgeSpace, Document, Chunk, Relation, Source
space = KnowledgeSpace.load("./ai-ready/handbook.indx")
doc: Document = space.documents(type="policy")[0]
print(doc.path, doc.lineage, doc.topics) # provenance + semantics
# Walk from a document to its chunks via chunk_ids.
chunks = [c for c in space.chunks if c.id in doc.chunk_ids]
first: Chunk = chunks[0]
print(first.text, first.neighbors) # text + adjacent ids
# Every chunk carries a Source and typed Relations.
src: Source = first.source
for rel in first.relations:
rel # Relation(type="references", to="legal/gdpr.md")

From here, deepen your understanding with the pipeline and stages concept, the bring-your-own-stack model, or the complete data-models reference.