Pipeline Overview

Every indx build is a DirectoryPipeline — an ordered set of six stages that turn a directory (or ZIP) into a KnowledgeSpace and a sealed .indx archive. Each stage receives and returns one shared, mutable SpaceContext, so work accumulates as the run proceeds.

The six stages

The pipeline runs in a fixed canonical order: 01 Walk → 02 Parse → 03 Chunk → 04 Relate → 05 Enrich → 06 Embed+Pack. Three stages are pure built-ins; three drive one or more swappable components.

| # | Stage | Responsibility | Primary component | |---|-------|----------------|-------------------| | 01 | Walk | Traverse folder/ZIP, build the directory graph, detect file types | — (built-in) | | 02 | Parse | Run each file through the chosen parser (text, tables, layout, images) | Parser | | 03 | Chunk | Split content with structure intact; keep lineage and neighbor links | — (built-in) | | 04 | Relate | Resolve references, siblings, parents, duplicates into typed Relations | — (built-in) | | 05 | Enrich | LLM/VLM add detected type, topics, tags, summaries | LLM, VLM | | 06 | Embed+Pack | Vectorize chunks, write vectors to the store, seal the .indx archive | Embedder, Store, OutputWriter |

Data flow

A single SpaceContext is threaded through the whole run. The command-line entry point constructs the pipeline, binds components, then hands the context to each stage in turn.

                          indx ./docs --out ./ai-ready --config indx.toml
                                            │
                                            ▼
                              ┌─────────────────────────┐
                              │     DirectoryPipeline    │
                              └─────────────────────────┘
                                            │
            ┌───────────────────────────────┴───────────────────────────────┐
            │            SpaceContext  (one object, threaded through)          │
            │   ctx.root · ctx.out · ctx.dir_graph · ctx.parser · ctx.parsed · │
            │   ctx.documents · ctx.chunks · ctx.relations · ctx.embeddings ·  │
            │   ctx.config · ctx.errors                                        │
            └───────────────────────────────┬───────────────────────────────┘
                                            │ run(ctx) -> ctx  (same object)
   01 Walk            02 Parse          03 Chunk          04 Relate        05 Enrich        06 Embed+Pack
 ┌────────┐         ┌────────┐        ┌────────┐        ┌────────┐       ┌────────┐       ┌──────────────┐
 │ folder │  dir    │ Parser │ Parsed │ split  │ Chunk  │ resolve│ Rel-  │ LLM/VLM│ meta  │ Embedder→Store│
 │ /ZIP   │ graph   │ .parse │ Doc    │ +line  │ list   │ refs/  │ ations│ topics │ data  │ + OutputWriter│
 │ walk   │────────▶│        │───────▶│ -age   │───────▶│ dups   │──────▶│ tags   │──────▶│ seal .indx    │
 └────────┘         └────────┘        └────────┘        └────────┘       └────────┘       └──────┬───────┘
   detect types       text/table/                                                                │
                      layout/image                                                                ▼
                                                                                       ┌───────────────────┐
                                                                                       │  KnowledgeSpace    │
                                                                                       │  + ./ai-ready/     │
                                                                                       │    handbook.indx   │
                                                                                       │    index.json      │
                                                                                       │    chunks/         │
                                                                                       │    embeddings/     │
                                                                                       └───────────────────┘

SpaceContext threading

A fresh DirectoryPipeline registers the six built-in stages in canonical order, binds the resolved components onto the context, and then runs each stage. Every stage obeys the same contract — run(ctx: SpaceContext) -> SpaceContext — and returns the same object it received, mutated in place. That gives stages a uniform, append-only view of accumulated work: the directory graph from Walk, ParsedDocs from Parse, Chunks from Chunk, Relations from Relate, enriched metadata from Enrich, and finally vectors from Embed+Pack. See the Stage protocol for the exact contract.

What each stage consumes and produces

| Stage | Reads from ctx | Writes to ctx | |-------|------------------|-----------------| | 01 Walk | root | dir_graph (folder→children + detected types) | | 02 Parse | dir_graph | parsed (doc_id → ParsedDoc), documents | | 03 Chunk | parsed, documents | chunks (with lineage + neighbor links) | | 04 Relate | documents, chunks | relations, plus per-object references | | 05 Enrich | documents, chunks | enriched metadata (type, topics, tags, summaries) | | 06 Embed+Pack | chunks, out | embeddings, the store, and the sealed .indx archive |

Stages are replaceable and droppable

The pipeline is a list you control, not a black box. You can swap the component a stage drives, insert a custom stage at any index, replace a stage by name, or drop a stage entirely.

drop("enrich") — skip all LLM/VLM work (no model available, or you don’t want enrichment). This is a common, supported operation.
drop("embed-pack") (or the --no-embed flag) — produce a graph-only space with no vectors.
Custom stages — anything that satisfies the Stage protocol and returns the same context can be inserted, e.g. a PII-redaction pass after Chunk.

from indx import DirectoryPipeline

# requires: pip install "indx[local]"  (docling + bge-m3 + qdrant)
pipeline = (
    DirectoryPipeline(embedder="bge-m3", store="qdrant")
    .use(parser="docling")   # swap a component
    .drop("enrich")          # skip LLM enrichment entirely
)
space = pipeline.run("./notes", "./out")

Explore each stage

01 Walk Traverse the folder/ZIP, build the directory graph, detect file types.

02 Parse Run each file through the chosen Parser into a ParsedDoc.

03 Chunk Split content with structure intact; keep lineage and neighbor links.

04 Relate Resolve references, siblings, parents, and duplicates into typed Relations.

05 Enrich LLM/VLM add detected type, topics, tags, and summaries.

06 Embed + Pack Vectorize chunks, write to the store, and seal the .indx archive.

Next steps

Read Pipeline & Stages for the conceptual model behind stages and the shared context.
See the protocols reference for the Stage, Parser, LLM, VLM, Embedder, Store, and OutputWriter contracts.
Learn to write your own with Custom Stage and Custom Components.