Skip to content

The Pipeline & Stages

DirectoryPipeline is the engine that turns a directory into a KnowledgeSpace. It is not a black box: it is an ordered list of stages you can read, reorder, extend, and trim. This page explains the architecture so the CLI, the SDK, and the output format all make sense as views onto the same idea.

A pipeline is an ordered set of stages. Each stage receives and returns a single shared, mutable object — the SpaceContext — and every stage obeys the same contract:

def run(self, ctx: SpaceContext) -> SpaceContext: ...

The crucial detail: a stage returns the same SpaceContext instance it received, mutated in place. Stages never communicate through globals or side channels — only through the context. This gives every stage a uniform, append-only view of the work done so far, and it is what makes stages freely insertable and removable.

A fresh DirectoryPipeline registers six built-in stages in canonical order. Each stage drives a particular component slot (or is fully built-in):

#Stage (name)One-line responsibilityComponent slot
01walkTraverse the folder/ZIP, build the directory graph, detect file types— (built-in)
02parseRun each file through the chosen parser → text, tables, layout, imagesParser
03chunkSplit content with structure intact; keep lineage and neighbor links— (built-in)
04relateResolve references, siblings, parents, duplicates → typed Relations— (built-in)
05enrichLLM/VLM add detected type, topics, tags, and summariesLLM, VLM
06embed-packVectorize chunks, write to the store, seal the .indx archiveEmbedder, Store, OutputWriter

For a deep dive into the inputs, outputs, and tuning of each stage, see the pipeline reference and the per-stage pages: Walk, Parse, Chunk, Relate, Enrich, and Embed+Pack.

The SpaceContext carries inputs (root, out, config), the resolved components, and a set of collections that fill up phase by phase. Each stage appends to the collection relevant to its work; later stages read what earlier stages produced:

01 Walk → ctx.dir_graph (folder → children + detected types)
02 Parse → ctx.parsed (doc_id → ParsedDoc)
03 Chunk → ctx.chunks (retrievable units, with neighbors)
04 Relate → ctx.relations (typed edges between docs/chunks)
05 Enrich → enriches ctx.documents (adds type, topics, tags, summary)
06 Embed+Pack → ctx.embeddings (chunk_id → vector) + sealed .indx

The document graph (ctx.documents) is built up across stages 01–05; vectors land last. When the pipeline finishes, the context is materialized into a KnowledgeSpace:

ctx.dir_graph → parsed → documents → chunks → relations → embeddings ⇒ KnowledgeSpace

That arrow diagram shows how the collections nest inside the final KnowledgeSpace — not the temporal order in which stages run. (The stages execute in the canonical 01–06 order above; ctx.documents is enriched in place across 01–05 rather than created at any single step.)

Per-item failures collected along the way are recorded on ctx.errors and surfaced on the result under space.metadata["errors"].

Every stage has a stable name (walk, parse, chunk, relate, enrich, embed-pack). Because stages are just an ordered list keyed by name, you can reshape the pipeline with a small, conceptual API on DirectoryPipeline:

MethodWhat it does
stages()Return the current ordered stage list.
insert(index, stage)Insert a custom stage at a 0-based position.
append(stage)Add a stage to the end.
replace(name, stage)Swap out the stage with the given name.
drop(name)Remove the named stage entirely.

These return the pipeline for chaining. (Components — not stages — are swapped separately with use(parser=..., llm=..., store=...); see Bring your own stack.)

from indx import DirectoryPipeline
# PiiRedactStage and MyChunker are user-defined — see /guides/custom-stage/.
pipeline = (
DirectoryPipeline(embedder="bge-m3", store="chroma")
.drop("enrich") # skip all LLM work
.insert(3, PiiRedactStage()) # lands after Chunk (index 2) and before Relate (which shifts from 3 to 4) — and therefore before Enrich
)
space = pipeline.run("./notes", "./out")
  • drop("enrich") — skip LLM/VLM work entirely. Useful when no model is available, or when you only need the structural graph (folders, neighbors, relations) without topics, tags, or summaries. This is a fully supported, common operation.
  • drop("embed-pack") — produce a graph-only space with no vectors. Handy when a downstream store self-embeds, or when you want to inspect structure before committing to an embedder. (The CLI exposes the same intent as --no-embed.)
  • insert(i, stage) — add a custom pass such as redaction before Enrich so sensitive content is stripped before any egress-capable component sees it, or a deduplication step before Relate.
  • replace("chunk", MyChunker()) — substitute a built-in stage with your own implementation.

Not every failure should stop a build. indx distinguishes two kinds:

  • Per-item (skip) — a single file fails to parse, or one document’s LLM call times out. The item is skipped, a skip-kind error is appended to ctx.errors, and the pipeline continues. This is the default behaviour for Parse and Enrich.
  • Fatal — misconfiguration, an unreachable store, an unresolvable component name, or a stage raising an unhandled exception. The pipeline aborts and re-raises a wrapping error.

The --strict CLI flag (and strict=True in the SDK) promotes every skip into a fatal error — so any single failure aborts the run. Either way, errors are visible: nothing is silently swallowed, and ctx.errors ends up on space.metadata["errors"] for inspection.

Because the pipeline is an ordered list and every stage shares one typed contract, the same code runs from a laptop to an air-gapped server: drop the cloud stages or swap in local components without rewriting orchestration. Stages stay replaceable and optional, components stay swappable by name, and the output stays deterministic and portable.