03 · Chunk
Stage 03 turns each ParsedDoc from Parse into an ordered list of Chunks — the retrievable units everything downstream depends on. Chunk is a built-in stage (no swappable component): it splits content while keeping structure intact, and stamps every chunk with the provenance, position, and neighbor links that let context travel with the text.
What Chunk does
Section titled “What Chunk does”Chunk reads the parsed documents accumulated in the shared SpaceContext (ctx.parsed, keyed by doc_id) and writes Chunk objects to ctx.chunks. Like every stage it obeys run(ctx: SpaceContext) -> SpaceContext and returns the same mutated context — see the pipeline overview and protocols reference.
For each document it:
- Splits content with structure intact. A
ParsedDoccarries more than flat text — it hasblocks(headings, paragraphs, list items),tables, andimages. Chunk uses these structural artifacts to split along natural boundaries rather than blindly slicing a character stream, so a heading stays with its section and a table is not cut mid-row. - Stamps provenance. Every chunk copies the document’s
Source(path,folder,type), so a chunk always knows which file and folder it came from. - Records position. Each chunk gets a
doc_id(its parentDocument) and a 0-basedindex— its position within that document. - Links neighbors. Each chunk’s
neighborslist holds its adjacent chunk ids (previous, next) within the same document. This adjacency is what establishes the implicitcontinuessequence used by Relate and surfaced at query time.
The Chunk model
Section titled “The Chunk model”Chunk is a Pydantic v2 model. The fields populated at this stage:
| Field | Type | Set by | Meaning |
|---|---|---|---|
id | str | Chunk | Stable, zero-padded id, e.g. chunk_0481. |
text | str | Chunk | The retrievable text payload. |
source | Source | Chunk | Originating document provenance (path, folder, type). |
doc_id | str | Chunk | Id of the parent Document, e.g. doc_0007. |
index | int | Chunk | 0-based position within the parent document. |
neighbors | list[str] | Chunk | Adjacent chunk ids (prev, next). |
metadata | dict[str, Any] | Enrich | Topics, summary, tags (added later). |
relations | list[Relation] | Relate | Outgoing typed edges (added later). |
embedding | list[float] or None | Embed+Pack | Vector (added later). |
So Chunk owns id, text, source, doc_id, index, and neighbors; the remaining fields are filled by later stages. The parent Document.chunk_ids list records, in order, the chunks produced from each document. Full field definitions live in the data models reference.
Stable, deterministic chunk ids
Section titled “Stable, deterministic chunk ids”Chunk ids are stable and zero-padded (chunk_0481, not chunk_481) and are assigned in a fixed deterministic traversal order:
folder lineage → then path → then in-document index
Because the order never depends on parallelism or filesystem iteration quirks, re-running indx over unchanged input yields identical chunk ids. This is a core part of indx’s reproducibility contract: the same inputs, config, and component versions produce a byte-identical index.json (modulo the created_at timestamp).
Parallel work earlier in the pipeline (parsing is embarrassingly parallel) is re-sorted into this deterministic order before ids are assigned, so concurrency never affects the resulting ids or their ordering.
Example chunk
Section titled “Example chunk”After Chunk runs, a chunk for a retention policy looks like this (the metadata and relations shown here are populated by later stages — continues adjacency is captured first as neighbors):
{ "id": "chunk_0481", "text": "Enterprise data is retained for 90 days…", "source": { "path": "policies/data/retention.pdf", "folder": "policies/data", "type": "policy" }, "doc_id": "doc_0007", "index": 1, "neighbors": ["chunk_0480", "chunk_0482"], "metadata": {}, "relations": []}Here chunk_0481 is the second chunk (index: 1) of doc_0007. Its neighbors point at the chunk before (chunk_0480) and after (chunk_0482) it — the raw material for the implicit continues sequence. At query time, space.search(...) returns each hit as a SearchHit whose resolved neighbors give an agent a ready-made context window around the match.
Chunking concerns
Section titled “Chunking concerns”How content is split is the part that most affects retrieval quality. The Chunk stage has to balance several concerns:
- Chunk size. Chunks must be small enough to be precise retrieval targets, but large enough to stand alone as meaningful context.
- Overlap. A sliding overlap between adjacent chunks can preserve context across boundaries, at the cost of some redundancy.
- Structure-awareness. Respecting headings, paragraphs, list items, and table boundaries from the
ParsedDockeeps semantically related content together rather than splitting on raw length alone.
Where chunks go next
Section titled “Where chunks go next”02 Parse ──▶ 03 Chunk ──▶ 04 Relate ──▶ 05 Enrich ──▶ 06 Embed+Pack ctx.chunks relations metadata vectors + .indx- 04 Relate reads chunk adjacency to materialize
continuesedges and resolvesreferences,sibling,parent, andduplicate-ofrelations. - 05 Enrich attaches
topics,tags, andsummaryto each chunk’smetadata. - 06 Embed+Pack vectorizes
chunk.text, writes vectors to the store, and seals the.indxarchive — with one JSON file per chunk underchunks/.
See also
Section titled “See also”- Data models reference — the full
Chunk,Source, andRelationdefinitions. - Reproducibility — deterministic id assignment and byte-stable output.
- 04 Relate — how neighbor links become typed
continuesrelations.