Pipeline Overview
Every indx build is a DirectoryPipeline — an ordered set of six stages that turn a directory (or ZIP) into a KnowledgeSpace and a sealed .indx archive. Each stage receives and returns one shared, mutable SpaceContext, so work accumulates as the run proceeds.
The six stages
Section titled “The six stages”The pipeline runs in a fixed canonical order: 01 Walk → 02 Parse → 03 Chunk → 04 Relate → 05 Enrich → 06 Embed+Pack. Three stages are pure built-ins; three drive one or more swappable components.
| # | Stage | Responsibility | Primary component |
|---|-------|----------------|-------------------|
| 01 | Walk | Traverse folder/ZIP, build the directory graph, detect file types | — (built-in) |
| 02 | Parse | Run each file through the chosen parser (text, tables, layout, images) | Parser |
| 03 | Chunk | Split content with structure intact; keep lineage and neighbor links | — (built-in) |
| 04 | Relate | Resolve references, siblings, parents, duplicates into typed Relations | — (built-in) |
| 05 | Enrich | LLM/VLM add detected type, topics, tags, summaries | LLM, VLM |
| 06 | Embed+Pack | Vectorize chunks, write vectors to the store, seal the .indx archive | Embedder, Store, OutputWriter |
Data flow
Section titled “Data flow”A single SpaceContext is threaded through the whole run. The command-line entry point constructs the pipeline, binds components, then hands the context to each stage in turn.
indx ./docs --out ./ai-ready --config indx.toml │ ▼ ┌─────────────────────────┐ │ DirectoryPipeline │ └─────────────────────────┘ │ ┌───────────────────────────────┴───────────────────────────────┐ │ SpaceContext (one object, threaded through) │ │ ctx.root · ctx.out · ctx.dir_graph · ctx.parser · ctx.parsed · │ │ ctx.documents · ctx.chunks · ctx.relations · ctx.embeddings · │ │ ctx.config · ctx.errors │ └───────────────────────────────┬───────────────────────────────┘ │ run(ctx) -> ctx (same object) 01 Walk 02 Parse 03 Chunk 04 Relate 05 Enrich 06 Embed+Pack ┌────────┐ ┌────────┐ ┌────────┐ ┌────────┐ ┌────────┐ ┌──────────────┐ │ folder │ dir │ Parser │ Parsed │ split │ Chunk │ resolve│ Rel- │ LLM/VLM│ meta │ Embedder→Store│ │ /ZIP │ graph │ .parse │ Doc │ +line │ list │ refs/ │ ations│ topics │ data │ + OutputWriter│ │ walk │────────▶│ │───────▶│ -age │───────▶│ dups │──────▶│ tags │──────▶│ seal .indx │ └────────┘ └────────┘ └────────┘ └────────┘ └────────┘ └──────┬───────┘ detect types text/table/ │ layout/image ▼ ┌───────────────────┐ │ KnowledgeSpace │ │ + ./ai-ready/ │ │ handbook.indx │ │ index.json │ │ chunks/ │ │ embeddings/ │ └───────────────────┘SpaceContext threading
Section titled “SpaceContext threading”A fresh DirectoryPipeline registers the six built-in stages in canonical order, binds the resolved components onto the context, and then runs each stage. Every stage obeys the same contract — run(ctx: SpaceContext) -> SpaceContext — and returns the same object it received, mutated in place. That gives stages a uniform, append-only view of accumulated work: the directory graph from Walk, ParsedDocs from Parse, Chunks from Chunk, Relations from Relate, enriched metadata from Enrich, and finally vectors from Embed+Pack. See the Stage protocol for the exact contract.
What each stage consumes and produces
Section titled “What each stage consumes and produces”| Stage | Reads from ctx | Writes to ctx |
|-------|------------------|-----------------|
| 01 Walk | root | dir_graph (folder→children + detected types) |
| 02 Parse | dir_graph | parsed (doc_id → ParsedDoc), documents |
| 03 Chunk | parsed, documents | chunks (with lineage + neighbor links) |
| 04 Relate | documents, chunks | relations, plus per-object references |
| 05 Enrich | documents, chunks | enriched metadata (type, topics, tags, summaries) |
| 06 Embed+Pack | chunks, out | embeddings, the store, and the sealed .indx archive |
Stages are replaceable and droppable
Section titled “Stages are replaceable and droppable”The pipeline is a list you control, not a black box. You can swap the component a stage drives, insert a custom stage at any index, replace a stage by name, or drop a stage entirely.
drop("enrich")— skip all LLM/VLM work (no model available, or you don’t want enrichment). This is a common, supported operation.drop("embed-pack")(or the--no-embedflag) — produce a graph-only space with no vectors.- Custom stages — anything that satisfies the
Stageprotocol and returns the same context can be inserted, e.g. a PII-redaction pass after Chunk.
from indx import DirectoryPipeline
# requires: pip install "indx[local]" (docling + bge-m3 + qdrant)pipeline = ( DirectoryPipeline(embedder="bge-m3", store="qdrant") .use(parser="docling") # swap a component .drop("enrich") # skip LLM enrichment entirely)space = pipeline.run("./notes", "./out")Explore each stage
Section titled “Explore each stage”Next steps
Section titled “Next steps”- Read Pipeline & Stages for the conceptual model behind stages and the shared context.
- See the protocols reference for the
Stage,Parser,LLM,VLM,Embedder,Store, andOutputWritercontracts. - Learn to write your own with Custom Stage and Custom Components.