Skip to content

Concepts Overview

indx is built around a deliberately small mental model. Once four ideas click — KnowledgeSpace, Document, Chunk, and Relation — the rest of the project stops feeling like an API to memorize and starts feeling obvious. The CLI, the SDK, and the output format are all just different views onto these same objects.

indx turns a directory into a knowledge space: it walks a folder, parses each file with a parser you choose, and then does the part most tooling skips — it models how those files, folders, and chunks relate, and writes the result out in a layout built for retrieval.

It is worth repeating the positioning, because it shapes every concept below: indx composes parsers; it does not replace them. A file parser answers “what does this one file say?” indx answers the harder question an agent actually asks: “how does this body of knowledge fit together, and what context do I need to trust a given piece of it?”

These four Pydantic models are the entire data model you need to understand to be productive. Field-by-field tables live in the data models reference — here we stay conceptual.

KnowledgeSpace — the top-level result of processing a directory. It holds the document graph, the chunks, the embeddings, and the build metadata, and it serializes to a single portable .indx archive. A KnowledgeSpace is a first-class object: you can inspect its stats, filter its documents(type=...), and run search(query, k=5) against it. It is what DirectoryPipeline.run(...) returns, and what KnowledgeSpace.load(path) reconstructs.

Document — one source file, enriched. Beyond the original path, a Document carries its folder lineage (root-to-leaf ancestry), a detected type, and the LLM-derived topics, tags, and summary. It also records the references it points to and the references that point back to it — the file-to-file edges most pipelines throw away at parse time.

Chunk — a retrievable unit that never forgets where it came from. Every Chunk keeps its source provenance, its parent document id, its 0-based position within that document, the ids of its immediate neighbor chunks, and any typed relations. Because that context travels with the text, retrieval returns something an agent can reason over rather than an orphaned fragment.

Relation — a typed, directed edge in the knowledge graph. A Relation can connect a document to a document, a chunk to a chunk, or a chunk to a document. The five RelationType values are the structure indx works to preserve:

| Type | Meaning | |------|---------| | sibling | Same folder / same logical group | | parent | Folder lineage / containment | | references | Outgoing citation, link, or mention | | continues | Next unit in a split sequence | | duplicate-of | Near or exact duplicate content |

Here is the canonical chunk as it appears in index.json. It is the clearest single illustration of why indx exists — almost every concept above shows up in one object.

{
"id": "chunk_0481",
"doc_id": "doc_0007",
"index": 12,
"text": "Enterprise data is retained for 90 days…",
"source": {
"path": "policies/data/retention.pdf",
"folder": "policies/data",
"type": "policy"
},
"metadata": {
"topics": ["retention", "compliance"],
"summary": "90-day retention rule…",
"tags": ["data-retention", "gdpr"]
},
"neighbors": ["chunk_0480", "chunk_0482"],
"relations": [
{ "type": "references", "to": "legal/gdpr.md" }
]
}

Walking through what each field shows:

  • id — a stable, zero-padded identifier (chunk_0481). Ids are assigned in a deterministic traversal order, so rerunning over unchanged input yields identical ids. That is what makes a knowledge space diffable and reproducible.
  • doc_id and index — the parent document id (doc_0007) and the chunk’s 0-based position within that document (12). Together they pin every chunk back to its source document and its place in the original reading order.
  • text — the retrievable payload. This is the part a file-level parser would give you on its own. Everything around it is what indx adds.
  • source — provenance. The chunk knows its original path, its containing folder, and the detected document type. An agent can filter by location (“only chunks under policies/”) or by type (“only policy documents”) without guessing.
  • metadata — the semantic layer added during Enrich: topics, summary, and tags. This is what lets retrieval reason instead of pattern-matching raw text.
  • neighbors — the ids of the adjacent chunks (chunk_0480, chunk_0482). Context travels with the chunk, so an agent can expand the window around a hit instead of retrieving it in isolation.
  • relations — typed edges leaving this chunk. Here a references edge points to legal/gdpr.md. This is the structure most pipelines discard — and the reason an agent can follow knowledge rather than just match it.

Under the hood, a DirectoryPipeline is just an ordered set of six stages with a single shared object flowing through them. That object is the SpaceContext: every stage obeys the contract run(ctx: SpaceContext) -> SpaceContext and returns the same object it received, mutated in place. Walk appends the directory graph, Parse appends ParsedDocs, Chunk appends Chunks, Relate appends Relations, Enrich fills in metadata, and Embed+Pack adds vectors and seals the archive.

The six ordered, replaceable stages are:

01 Walk → 02 Parse → 03 Chunk → 04 Relate → 05 Enrich → 06 Embed+Pack

Because the pipeline is a list you control rather than a black box, you can insert a custom stage (say, a redaction pass before enrichment) or drop one entirely (skip Enrich when no LLM is available). The full walkthrough lives in Pipeline and stages and the pipeline overview.

If you would rather see the model in action first, the quickstart builds and queries a real knowledge space in under a minute, and your first knowledge space walks through the output directory file by file.