Skip to content

Pipeline Overview

Every indx build is a DirectoryPipeline — an ordered set of six stages that turn a directory (or ZIP) into a KnowledgeSpace and a sealed .indx archive. Each stage receives and returns one shared, mutable SpaceContext, so work accumulates as the run proceeds.

The pipeline runs in a fixed canonical order: 01 Walk → 02 Parse → 03 Chunk → 04 Relate → 05 Enrich → 06 Embed+Pack. Three stages are pure built-ins; three drive one or more swappable components.

| # | Stage | Responsibility | Primary component | |---|-------|----------------|-------------------| | 01 | Walk | Traverse folder/ZIP, build the directory graph, detect file types | — (built-in) | | 02 | Parse | Run each file through the chosen parser (text, tables, layout, images) | Parser | | 03 | Chunk | Split content with structure intact; keep lineage and neighbor links | — (built-in) | | 04 | Relate | Resolve references, siblings, parents, duplicates into typed Relations | — (built-in) | | 05 | Enrich | LLM/VLM add detected type, topics, tags, summaries | LLM, VLM | | 06 | Embed+Pack | Vectorize chunks, write vectors to the store, seal the .indx archive | Embedder, Store, OutputWriter |

A single SpaceContext is threaded through the whole run. The command-line entry point constructs the pipeline, binds components, then hands the context to each stage in turn.

indx ./docs --out ./ai-ready --config indx.toml
┌─────────────────────────┐
│ DirectoryPipeline │
└─────────────────────────┘
┌───────────────────────────────┴───────────────────────────────┐
│ SpaceContext (one object, threaded through) │
│ ctx.root · ctx.out · ctx.dir_graph · ctx.parser · ctx.parsed · │
│ ctx.documents · ctx.chunks · ctx.relations · ctx.embeddings · │
│ ctx.config · ctx.errors │
└───────────────────────────────┬───────────────────────────────┘
│ run(ctx) -> ctx (same object)
01 Walk 02 Parse 03 Chunk 04 Relate 05 Enrich 06 Embed+Pack
┌────────┐ ┌────────┐ ┌────────┐ ┌────────┐ ┌────────┐ ┌──────────────┐
│ folder │ dir │ Parser │ Parsed │ split │ Chunk │ resolve│ Rel- │ LLM/VLM│ meta │ Embedder→Store│
│ /ZIP │ graph │ .parse │ Doc │ +line │ list │ refs/ │ ations│ topics │ data │ + OutputWriter│
│ walk │────────▶│ │───────▶│ -age │───────▶│ dups │──────▶│ tags │──────▶│ seal .indx │
└────────┘ └────────┘ └────────┘ └────────┘ └────────┘ └──────┬───────┘
detect types text/table/ │
layout/image ▼
┌───────────────────┐
│ KnowledgeSpace │
│ + ./ai-ready/ │
│ handbook.indx │
│ index.json │
│ chunks/ │
│ embeddings/ │
└───────────────────┘

A fresh DirectoryPipeline registers the six built-in stages in canonical order, binds the resolved components onto the context, and then runs each stage. Every stage obeys the same contract — run(ctx: SpaceContext) -> SpaceContext — and returns the same object it received, mutated in place. That gives stages a uniform, append-only view of accumulated work: the directory graph from Walk, ParsedDocs from Parse, Chunks from Chunk, Relations from Relate, enriched metadata from Enrich, and finally vectors from Embed+Pack. See the Stage protocol for the exact contract.

| Stage | Reads from ctx | Writes to ctx | |-------|------------------|-----------------| | 01 Walk | root | dir_graph (folder→children + detected types) | | 02 Parse | dir_graph | parsed (doc_id → ParsedDoc), documents | | 03 Chunk | parsed, documents | chunks (with lineage + neighbor links) | | 04 Relate | documents, chunks | relations, plus per-object references | | 05 Enrich | documents, chunks | enriched metadata (type, topics, tags, summaries) | | 06 Embed+Pack | chunks, out | embeddings, the store, and the sealed .indx archive |

The pipeline is a list you control, not a black box. You can swap the component a stage drives, insert a custom stage at any index, replace a stage by name, or drop a stage entirely.

  • drop("enrich") — skip all LLM/VLM work (no model available, or you don’t want enrichment). This is a common, supported operation.
  • drop("embed-pack") (or the --no-embed flag) — produce a graph-only space with no vectors.
  • Custom stages — anything that satisfies the Stage protocol and returns the same context can be inserted, e.g. a PII-redaction pass after Chunk.
from indx import DirectoryPipeline
# requires: pip install "indx[local]" (docling + bge-m3 + qdrant)
pipeline = (
DirectoryPipeline(embedder="bge-m3", store="qdrant")
.use(parser="docling") # swap a component
.drop("enrich") # skip LLM enrichment entirely
)
space = pipeline.run("./notes", "./out")