Concepts Overview
indx is built around a deliberately small mental model. Once four ideas click — KnowledgeSpace, Document, Chunk, and Relation — the rest of the project stops feeling like an API to memorize and starts feeling obvious. The CLI, the SDK, and the output format are all just different views onto these same objects.
The whole idea in one sentence
Section titled “The whole idea in one sentence”indx turns a directory into a knowledge space: it walks a folder, parses each file with a parser you choose, and then does the part most tooling skips — it models how those files, folders, and chunks relate, and writes the result out in a layout built for retrieval.
It is worth repeating the positioning, because it shapes every concept below: indx composes parsers; it does not replace them. A file parser answers “what does this one file say?” indx answers the harder question an agent actually asks: “how does this body of knowledge fit together, and what context do I need to trust a given piece of it?”
The four core objects
Section titled “The four core objects”These four Pydantic models are the entire data model you need to understand to be productive. Field-by-field tables live in the data models reference — here we stay conceptual.
KnowledgeSpace — the top-level result of processing a directory. It holds the document graph, the chunks, the embeddings, and the build metadata, and it serializes to a single portable .indx archive. A KnowledgeSpace is a first-class object: you can inspect its stats, filter its documents(type=...), and run search(query, k=5) against it. It is what DirectoryPipeline.run(...) returns, and what KnowledgeSpace.load(path) reconstructs.
Document — one source file, enriched. Beyond the original path, a Document carries its folder lineage (root-to-leaf ancestry), a detected type, and the LLM-derived topics, tags, and summary. It also records the references it points to and the references that point back to it — the file-to-file edges most pipelines throw away at parse time.
Chunk — a retrievable unit that never forgets where it came from. Every Chunk keeps its source provenance, its parent document id, its 0-based position within that document, the ids of its immediate neighbor chunks, and any typed relations. Because that context travels with the text, retrieval returns something an agent can reason over rather than an orphaned fragment.
Relation — a typed, directed edge in the knowledge graph. A Relation can connect a document to a document, a chunk to a chunk, or a chunk to a document. The five RelationType values are the structure indx works to preserve:
| Type | Meaning |
|------|---------|
| sibling | Same folder / same logical group |
| parent | Folder lineage / containment |
| references | Outgoing citation, link, or mention |
| continues | Next unit in a split sequence |
| duplicate-of | Near or exact duplicate content |
A chunk that remembers everything
Section titled “A chunk that remembers everything”Here is the canonical chunk as it appears in index.json. It is the clearest single illustration of why indx exists — almost every concept above shows up in one object.
{ "id": "chunk_0481", "doc_id": "doc_0007", "index": 12, "text": "Enterprise data is retained for 90 days…", "source": { "path": "policies/data/retention.pdf", "folder": "policies/data", "type": "policy" }, "metadata": { "topics": ["retention", "compliance"], "summary": "90-day retention rule…", "tags": ["data-retention", "gdpr"] }, "neighbors": ["chunk_0480", "chunk_0482"], "relations": [ { "type": "references", "to": "legal/gdpr.md" } ]}Walking through what each field shows:
id— a stable, zero-padded identifier (chunk_0481). Ids are assigned in a deterministic traversal order, so rerunning over unchanged input yields identical ids. That is what makes a knowledge space diffable and reproducible.doc_idandindex— the parent document id (doc_0007) and the chunk’s 0-based position within that document (12). Together they pin every chunk back to its source document and its place in the original reading order.text— the retrievable payload. This is the part a file-level parser would give you on its own. Everything around it is what indx adds.source— provenance. The chunk knows its originalpath, its containingfolder, and the detected documenttype. An agent can filter by location (“only chunks underpolicies/”) or by type (“onlypolicydocuments”) without guessing.metadata— the semantic layer added during Enrich:topics,summary, andtags. This is what lets retrieval reason instead of pattern-matching raw text.neighbors— the ids of the adjacent chunks (chunk_0480,chunk_0482). Context travels with the chunk, so an agent can expand the window around a hit instead of retrieving it in isolation.relations— typed edges leaving this chunk. Here areferencesedge points tolegal/gdpr.md. This is the structure most pipelines discard — and the reason an agent can follow knowledge rather than just match it.
One shared context, a pipeline of stages
Section titled “One shared context, a pipeline of stages”Under the hood, a DirectoryPipeline is just an ordered set of six stages with a single shared object flowing through them. That object is the SpaceContext: every stage obeys the contract run(ctx: SpaceContext) -> SpaceContext and returns the same object it received, mutated in place. Walk appends the directory graph, Parse appends ParsedDocs, Chunk appends Chunks, Relate appends Relations, Enrich fills in metadata, and Embed+Pack adds vectors and seals the archive.
The six ordered, replaceable stages are:
01 Walk → 02 Parse → 03 Chunk → 04 Relate → 05 Enrich → 06 Embed+Pack
Because the pipeline is a list you control rather than a black box, you can insert a custom stage (say, a redaction pass before enrichment) or drop one entirely (skip Enrich when no LLM is available). The full walkthrough lives in Pipeline and stages and the pipeline overview.
Where to go next
Section titled “Where to go next”If you would rather see the model in action first, the quickstart builds and queries a real knowledge space in under a minute, and your first knowledge space walks through the output directory file by file.