Component Protocols
indx is built on structural typing: every replaceable part of the pipeline is a Python typing.Protocol with a default implementation, and any object that satisfies the protocol’s shape can be dropped straight in. This page is the authoritative contract for the seven protocols — Stage, Parser, LLM, VLM, Embedder, Store, and OutputWriter — plus the rules for resolving a component by name.
There is no base class to inherit from. To provide your own component you only need to match the method signatures shown below. See Bring your own components for worked examples and Authoring a plugin for packaging one for distribution.
How components are bound
Section titled “How components are bound”Every swappable component can be supplied two ways:
- By object — pass an instance in code, at construction or via
DirectoryPipeline.use(...). - By name — a short registry string (e.g.
"docling","qdrant"), optionally with a:modelsuffix (e.g."ollama:qwen2.5"), set in code, on the CLI, or inindx.toml.
Both routes converge on the same typed protocol. Name resolution is covered in Resolution by name below and in Registry and defaults.
The Stage protocol
Section titled “The Stage protocol”A Stage is a single unit of work in a DirectoryPipeline. The pipeline executes stages in registration order, passing one shared SpaceContext from each stage to the next.
from typing import Protocol, runtime_checkable
@runtime_checkableclass Stage(Protocol): """A single unit of work in a DirectoryPipeline.
A stage MUST return the same SpaceContext instance it received, mutated in place. The pipeline executes stages in registration order, passing the context from each stage to the next. """ name: str # stable identifier, e.g. "walk", "parse", "chunk", "relate", "enrich", "embed-pack"
def run(self, ctx: SpaceContext) -> SpaceContext: ...| Member | Contract |
|---|---|
name | A stable string identifier ("walk", "parse", "chunk", "relate", "enrich", "embed-pack"). Stages are addressed by name for replace(name, ...) and drop(name). |
run(ctx) | Does the stage’s work and returns the same SpaceContext instance, mutated in place — never a new object and never None. |
The built-in stages are not registered in any per-slot component registry; they are wired by DirectoryPipeline directly. Third-party stages are discovered through the indx.stages entry-point group. See custom stages for the full authoring recipe and the pipeline reference for stage ordering.
The Parser protocol
Section titled “The Parser protocol”Converts a single file into a ParsedDoc. Called once per file by stage 02 (Parse).
@runtime_checkableclass Parser(Protocol): """Converts a single file into a ParsedDoc. Default: docling.""" def parse(self, file: "FileRef") -> ParsedDoc: ...| Member | Contract |
|---|---|
parse(file) | Takes a FileRef and returns a fully populated ParsedDoc — normalized text, plus structural blocks, tables, and images that the Chunk stage uses to split with structure intact. |
FileRef is the lightweight handle produced by stage 01 (Walk) and passed to the parser. It carries the resolved path, a bytes accessor, the detected MIME/type, and the folder lineage. In practice a parser reads the file through it (for example file.read_text() or file.path, file.folder) and emits a ParsedDoc whose Source records path, folder, and type.
- Default:
docling→DoclingParser. - Registry name(s):
docling(plus any installed adapters, e.g.unstructured,llamaparse,markitdown, and the zero-dependencyplaintextfallback that ships in core).
See Choosing a parser for the trade-offs between engines.
The LLM protocol
Section titled “The LLM protocol”Text generation used by stage 05 (Enrich) to derive document type, topics, tags, and summaries.
@runtime_checkableclass LLM(Protocol): """Text generation for enrichment (type, topics, tags, summaries). Default: openai:gpt-5-mini.""" def complete(self, prompt: str, *, system: str | None = None, max_tokens: int = 512, temperature: float = 0.0) -> str: ...| Parameter | Default | Meaning |
|---|---|---|
prompt | — (required) | The user prompt to complete. |
system | None | Optional system instruction. |
max_tokens | 512 | Upper bound on generated tokens. |
temperature | 0.0 | Sampling temperature; the default of 0.0 keeps enrichment deterministic. |
complete(...) returns the generated text as a plain str.
- Default:
openai:gpt-5-mini→OpenAILLMrunning thegpt-5-minimodel; useollama:qwen2.5for local, no-egress enrichment. - Registry name(s):
ollama:<model>,none(and installed cloud adapters such asopenai,anthropic,azure,vllm).
Enrichment can be turned off entirely with --llm none, by dropping the enrich stage, or via [enrich] llm = "none" in indx.toml. See Enrichment with LLM and VLM.
The VLM protocol
Section titled “The VLM protocol”Vision-language enrichment for images and figures, also used by stage 05 (Enrich). Disabled by default.
@runtime_checkableclass VLM(Protocol): """Vision-language enrichment for images/layout. Default: none (disabled).""" def describe(self, image: bytes, *, prompt: str | None = None) -> str: ...| Parameter | Default | Meaning |
|---|---|---|
image | — (required) | The raw image bytes to describe. |
prompt | None | Optional instruction steering the description. |
describe(...) returns a textual description as a str.
- Default:
none→NullVLM, a no-op that ships in core and skips vision enrichment. - Registry name(s):
none, plus installed adapters such asqwen-vl,gpt-4o,vlm-local.
Because the default is none, image understanding is strictly opt-in. Enable it with --vlm <adapter> or [enrich] vlm = "<adapter>". See Enrichment with LLM and VLM.
The Embedder protocol
Section titled “The Embedder protocol”Turns chunk text into vectors. Driven by stage 06 (Embed+Pack), in batches.
@runtime_checkableclass Embedder(Protocol): """Turns text into vectors. Default: openai:text-embedding-3-small.""" dim: int def embed(self, texts: list[str]) -> list[list[float]]: ...| Member | Contract |
|---|---|
dim | The vector dimensionality (an int). Pinned into the archive manifest so a loaded archive can validate query-time compatibility. |
embed(texts) | Accepts a list[str] and returns a list[list[float]] — one vector per input, each of length dim, in the same order. |
- Default:
openai:text-embedding-3-small→OpenAIEmbedder, dimension1536; usebge-m3for the local profile. - Registry name(s):
bge-m3(and installed adapters such ase5,openai,cohere).
The Embed+Pack stage groups chunk texts into batches (default 64) before each embed(...) call. See Choosing an embedder.
The Store protocol
Section titled “The Store protocol”The vector-database adapter. Used by stage 06 to persist vectors and to back space.search(...).
@runtime_checkableclass Store(Protocol): """Vector database adapter. Default: qdrant. Also: pgvector, chroma, lancedb, jsonl.""" def upsert(self, ids: list[str], vectors: list[list[float]], payloads: list[dict]) -> None: ... def query(self, vector: list[float], k: int = 5, filter: dict | None = None) -> list[tuple[str, float]]: ... def persist(self, dest: str) -> None: """Flush/export the store into the output `embeddings/` layout."""| Method | Contract |
|---|---|
upsert(ids, vectors, payloads) | Inserts or updates vectors. The three lists are positionally aligned: ids[i] (chunk id), vectors[i] (its embedding), and payloads[i] (its metadata dict). Returns None. |
query(vector, k=5, filter=None) | Returns the top-k matches as a list[tuple[str, float]] of (id, score), highest score first. An optional filter dict restricts the search (for example by document type). |
persist(dest) | Flushes/exports the store into the output embeddings/ layout so the resulting .indx archive is portable regardless of backend. |
- Default:
qdrant→QdrantStore. - Registry name(s):
qdrant,pgvector,chroma,lancedb,jsonl(the zero-dependencyjsonlfallback ships in core and enables a fully offline run).
See Choosing a store and the .indx archive reference.
The OutputWriter protocol
Section titled “The OutputWriter protocol”Serializes a finished KnowledgeSpace to disk. The final responsibility of stage 06.
@runtime_checkableclass OutputWriter(Protocol): """Serializes a KnowledgeSpace to disk. Default: .indx. Also: jsonl, langchain, llamaindex.""" format: str def write(self, space: KnowledgeSpace, out: str) -> None: ...| Member | Contract |
|---|---|
format | A string naming the output format this writer emits (for example ".indx"). |
write(space, out) | Serializes the KnowledgeSpace to the output directory out. Returns None. |
- Default:
.indx→IndxWriter, which seals the portable archive. - Registry name(s):
.indx,jsonl,langchain,llamaindex(thejsonlwriter ships in core).
See Output formats for what each writer emits.
Resolution by name
Section titled “Resolution by name”Each component sub-package maintains a registry mapping a short name to a class. A name may carry an optional :model suffix; the base name selects the adapter class and the suffix selects the model.
REGISTRY = {"ollama": OllamaLLM, "none": NullLLM}
def resolve(name: str) -> LLM: base, _, model = name.partition(":") cls = REGISTRY[base] # raises StageError(kind="fatal") if missing return cls(model=model or None)Slots, names, and defaults
Section titled “Slots, names, and defaults”| Slot | Name strings | Class (default) |
|---|---|---|
parser | docling | DoclingParser |
llm | openai:<model>, ollama:<model>, none | OpenAILLM (gpt-5-mini) |
vlm | none, <adapter> | NullVLM |
embedder | openai:text-embedding-3-small, bge-m3 (plus adapters e5, cohere) | OpenAIEmbedder (dim 1536) |
store | qdrant, pgvector, chroma, lancedb, jsonl | QdrantStore |
output | .indx, jsonl, langchain, llamaindex | IndxWriter |
Resolution order
Section titled “Resolution order”For each unset slot the effective component is chosen in this priority order:
explicit object/string > indx.toml > documented defaultAn explicit instance or name passed in code (or to use()) wins; otherwise indx.toml is consulted; otherwise the documented default applies.
Third-party adapters register additional names through entry-point groups (indx.parsers, indx.llms, indx.vlms, indx.embedders, indx.stores, indx.outputs, indx.stages); once installed, an adapter is usable by name anywhere a built-in is. First-party builtins win on name collisions. For the complete registry mechanics see Registry and defaults and Authoring a plugin.