03 · Chunk

Stage 03 turns each ParsedDoc from Parse into an ordered list of Chunks — the retrievable units everything downstream depends on. Chunk is a built-in stage (no swappable component): it splits content while keeping structure intact, and stamps every chunk with the provenance, position, and neighbor links that let context travel with the text.

What Chunk does

Chunk reads the parsed documents accumulated in the shared SpaceContext (ctx.parsed, keyed by doc_id) and writes Chunk objects to ctx.chunks. Like every stage it obeys run(ctx: SpaceContext) -> SpaceContext and returns the same mutated context — see the pipeline overview and protocols reference.

For each document it:

Splits content with structure intact. A ParsedDoc carries more than flat text — it has blocks (headings, paragraphs, list items), tables, and images. Chunk uses these structural artifacts to split along natural boundaries rather than blindly slicing a character stream, so a heading stays with its section and a table is not cut mid-row.
Stamps provenance. Every chunk copies the document’s Source (path, folder, type), so a chunk always knows which file and folder it came from.
Records position. Each chunk gets a doc_id (its parent Document) and a 0-based index — its position within that document.
Links neighbors. Each chunk’s neighbors list holds its adjacent chunk ids (previous, next) within the same document. This adjacency is what establishes the implicit continues sequence used by Relate and surfaced at query time.

The Chunk model

Chunk is a Pydantic v2 model. The fields populated at this stage:

Field	Type	Set by	Meaning
`id`	`str`	Chunk	Stable, zero-padded id, e.g. `chunk_0481`.
`text`	`str`	Chunk	The retrievable text payload.
`source`	`Source`	Chunk	Originating document provenance (`path`, `folder`, `type`).
`doc_id`	`str`	Chunk	Id of the parent `Document`, e.g. `doc_0007`.
`index`	`int`	Chunk	0-based position within the parent document.
`neighbors`	`list[str]`	Chunk	Adjacent chunk ids (prev, next).
`metadata`	`dict[str, Any]`	Enrich	Topics, summary, tags (added later).
`relations`	`list[Relation]`	Relate	Outgoing typed edges (added later).
`embedding`	`list[float]` or `None`	Embed+Pack	Vector (added later).

So Chunk owns id, text, source, doc_id, index, and neighbors; the remaining fields are filled by later stages. The parent Document.chunk_ids list records, in order, the chunks produced from each document. Full field definitions live in the data models reference.

Stable, deterministic chunk ids

Chunk ids are stable and zero-padded (chunk_0481, not chunk_481) and are assigned in a fixed deterministic traversal order:

folder lineage → then path → then in-document index

Because the order never depends on parallelism or filesystem iteration quirks, re-running indx over unchanged input yields identical chunk ids. This is a core part of indx’s reproducibility contract: the same inputs, config, and component versions produce a byte-identical index.json (modulo the created_at timestamp).

Parallel work earlier in the pipeline (parsing is embarrassingly parallel) is re-sorted into this deterministic order before ids are assigned, so concurrency never affects the resulting ids or their ordering.

Example chunk

After Chunk runs, a chunk for a retention policy looks like this (the metadata and relations shown here are populated by later stages — continues adjacency is captured first as neighbors):

{
  "id": "chunk_0481",
  "text": "Enterprise data is retained for 90 days…",
  "source": {
    "path": "policies/data/retention.pdf",
    "folder": "policies/data",
    "type": "policy"
  },
  "doc_id": "doc_0007",
  "index": 1,
  "neighbors": ["chunk_0480", "chunk_0482"],
  "metadata": {},
  "relations": []
}

Here chunk_0481 is the second chunk (index: 1) of doc_0007. Its neighbors point at the chunk before (chunk_0480) and after (chunk_0482) it — the raw material for the implicit continues sequence. At query time, space.search(...) returns each hit as a SearchHit whose resolved neighbors give an agent a ready-made context window around the match.

Chunking concerns

How content is split is the part that most affects retrieval quality. The Chunk stage has to balance several concerns:

Chunk size. Chunks must be small enough to be precise retrieval targets, but large enough to stand alone as meaningful context.
Overlap. A sliding overlap between adjacent chunks can preserve context across boundaries, at the cost of some redundancy.
Structure-awareness. Respecting headings, paragraphs, list items, and table boundaries from the ParsedDoc keeps semantically related content together rather than splitting on raw length alone.

Where chunks go next

02 Parse ──▶ 03 Chunk ──▶ 04 Relate ──▶ 05 Enrich ──▶ 06 Embed+Pack
            ctx.chunks    relations     metadata       vectors + .indx

04 Relate reads chunk adjacency to materialize continues edges and resolves references, sibling, parent, and duplicate-of relations.
05 Enrich attaches topics, tags, and summary to each chunk’s metadata.
06 Embed+Pack vectorizes chunk.text, writes vectors to the store, and seals the .indx archive — with one JSON file per chunk under chunks/.