Skip to content

03 · Chunk

Stage 03 turns each ParsedDoc from Parse into an ordered list of Chunks — the retrievable units everything downstream depends on. Chunk is a built-in stage (no swappable component): it splits content while keeping structure intact, and stamps every chunk with the provenance, position, and neighbor links that let context travel with the text.

Chunk reads the parsed documents accumulated in the shared SpaceContext (ctx.parsed, keyed by doc_id) and writes Chunk objects to ctx.chunks. Like every stage it obeys run(ctx: SpaceContext) -> SpaceContext and returns the same mutated context — see the pipeline overview and protocols reference.

For each document it:

  1. Splits content with structure intact. A ParsedDoc carries more than flat text — it has blocks (headings, paragraphs, list items), tables, and images. Chunk uses these structural artifacts to split along natural boundaries rather than blindly slicing a character stream, so a heading stays with its section and a table is not cut mid-row.
  2. Stamps provenance. Every chunk copies the document’s Source (path, folder, type), so a chunk always knows which file and folder it came from.
  3. Records position. Each chunk gets a doc_id (its parent Document) and a 0-based index — its position within that document.
  4. Links neighbors. Each chunk’s neighbors list holds its adjacent chunk ids (previous, next) within the same document. This adjacency is what establishes the implicit continues sequence used by Relate and surfaced at query time.

Chunk is a Pydantic v2 model. The fields populated at this stage:

FieldTypeSet byMeaning
idstrChunkStable, zero-padded id, e.g. chunk_0481.
textstrChunkThe retrievable text payload.
sourceSourceChunkOriginating document provenance (path, folder, type).
doc_idstrChunkId of the parent Document, e.g. doc_0007.
indexintChunk0-based position within the parent document.
neighborslist[str]ChunkAdjacent chunk ids (prev, next).
metadatadict[str, Any]EnrichTopics, summary, tags (added later).
relationslist[Relation]RelateOutgoing typed edges (added later).
embeddinglist[float] or NoneEmbed+PackVector (added later).

So Chunk owns id, text, source, doc_id, index, and neighbors; the remaining fields are filled by later stages. The parent Document.chunk_ids list records, in order, the chunks produced from each document. Full field definitions live in the data models reference.

Chunk ids are stable and zero-padded (chunk_0481, not chunk_481) and are assigned in a fixed deterministic traversal order:

folder lineage → then path → then in-document index

Because the order never depends on parallelism or filesystem iteration quirks, re-running indx over unchanged input yields identical chunk ids. This is a core part of indx’s reproducibility contract: the same inputs, config, and component versions produce a byte-identical index.json (modulo the created_at timestamp).

Parallel work earlier in the pipeline (parsing is embarrassingly parallel) is re-sorted into this deterministic order before ids are assigned, so concurrency never affects the resulting ids or their ordering.

After Chunk runs, a chunk for a retention policy looks like this (the metadata and relations shown here are populated by later stages — continues adjacency is captured first as neighbors):

{
"id": "chunk_0481",
"text": "Enterprise data is retained for 90 days…",
"source": {
"path": "policies/data/retention.pdf",
"folder": "policies/data",
"type": "policy"
},
"doc_id": "doc_0007",
"index": 1,
"neighbors": ["chunk_0480", "chunk_0482"],
"metadata": {},
"relations": []
}

Here chunk_0481 is the second chunk (index: 1) of doc_0007. Its neighbors point at the chunk before (chunk_0480) and after (chunk_0482) it — the raw material for the implicit continues sequence. At query time, space.search(...) returns each hit as a SearchHit whose resolved neighbors give an agent a ready-made context window around the match.

How content is split is the part that most affects retrieval quality. The Chunk stage has to balance several concerns:

  • Chunk size. Chunks must be small enough to be precise retrieval targets, but large enough to stand alone as meaningful context.
  • Overlap. A sliding overlap between adjacent chunks can preserve context across boundaries, at the cost of some redundancy.
  • Structure-awareness. Respecting headings, paragraphs, list items, and table boundaries from the ParsedDoc keeps semantically related content together rather than splitting on raw length alone.
02 Parse ──▶ 03 Chunk ──▶ 04 Relate ──▶ 05 Enrich ──▶ 06 Embed+Pack
ctx.chunks relations metadata vectors + .indx
  • 04 Relate reads chunk adjacency to materialize continues edges and resolves references, sibling, parent, and duplicate-of relations.
  • 05 Enrich attaches topics, tags, and summary to each chunk’s metadata.
  • 06 Embed+Pack vectorizes chunk.text, writes vectors to the store, and seals the .indx archive — with one JSON file per chunk under chunks/.