Skip to content

Component Protocols

indx is built on structural typing: every replaceable part of the pipeline is a Python typing.Protocol with a default implementation, and any object that satisfies the protocol’s shape can be dropped straight in. This page is the authoritative contract for the seven protocols — Stage, Parser, LLM, VLM, Embedder, Store, and OutputWriter — plus the rules for resolving a component by name.

There is no base class to inherit from. To provide your own component you only need to match the method signatures shown below. See Bring your own components for worked examples and Authoring a plugin for packaging one for distribution.

Every swappable component can be supplied two ways:

  • By object — pass an instance in code, at construction or via DirectoryPipeline.use(...).
  • By name — a short registry string (e.g. "docling", "qdrant"), optionally with a :model suffix (e.g. "ollama:qwen2.5"), set in code, on the CLI, or in indx.toml.

Both routes converge on the same typed protocol. Name resolution is covered in Resolution by name below and in Registry and defaults.

A Stage is a single unit of work in a DirectoryPipeline. The pipeline executes stages in registration order, passing one shared SpaceContext from each stage to the next.

from typing import Protocol, runtime_checkable
@runtime_checkable
class Stage(Protocol):
"""A single unit of work in a DirectoryPipeline.
A stage MUST return the same SpaceContext instance it received, mutated in
place. The pipeline executes stages in registration order, passing the
context from each stage to the next.
"""
name: str # stable identifier, e.g. "walk", "parse", "chunk", "relate", "enrich", "embed-pack"
def run(self, ctx: SpaceContext) -> SpaceContext: ...
MemberContract
nameA stable string identifier ("walk", "parse", "chunk", "relate", "enrich", "embed-pack"). Stages are addressed by name for replace(name, ...) and drop(name).
run(ctx)Does the stage’s work and returns the same SpaceContext instance, mutated in place — never a new object and never None.

The built-in stages are not registered in any per-slot component registry; they are wired by DirectoryPipeline directly. Third-party stages are discovered through the indx.stages entry-point group. See custom stages for the full authoring recipe and the pipeline reference for stage ordering.

Converts a single file into a ParsedDoc. Called once per file by stage 02 (Parse).

@runtime_checkable
class Parser(Protocol):
"""Converts a single file into a ParsedDoc. Default: docling."""
def parse(self, file: "FileRef") -> ParsedDoc: ...
MemberContract
parse(file)Takes a FileRef and returns a fully populated ParsedDoc — normalized text, plus structural blocks, tables, and images that the Chunk stage uses to split with structure intact.

FileRef is the lightweight handle produced by stage 01 (Walk) and passed to the parser. It carries the resolved path, a bytes accessor, the detected MIME/type, and the folder lineage. In practice a parser reads the file through it (for example file.read_text() or file.path, file.folder) and emits a ParsedDoc whose Source records path, folder, and type.

  • Default: doclingDoclingParser.
  • Registry name(s): docling (plus any installed adapters, e.g. unstructured, llamaparse, markitdown, and the zero-dependency plaintext fallback that ships in core).

See Choosing a parser for the trade-offs between engines.

Text generation used by stage 05 (Enrich) to derive document type, topics, tags, and summaries.

@runtime_checkable
class LLM(Protocol):
"""Text generation for enrichment (type, topics, tags, summaries).
Default: openai:gpt-5-mini."""
def complete(self, prompt: str, *, system: str | None = None,
max_tokens: int = 512, temperature: float = 0.0) -> str: ...
ParameterDefaultMeaning
prompt— (required)The user prompt to complete.
systemNoneOptional system instruction.
max_tokens512Upper bound on generated tokens.
temperature0.0Sampling temperature; the default of 0.0 keeps enrichment deterministic.

complete(...) returns the generated text as a plain str.

  • Default: openai:gpt-5-miniOpenAILLM running the gpt-5-mini model; use ollama:qwen2.5 for local, no-egress enrichment.
  • Registry name(s): ollama:<model>, none (and installed cloud adapters such as openai, anthropic, azure, vllm).

Enrichment can be turned off entirely with --llm none, by dropping the enrich stage, or via [enrich] llm = "none" in indx.toml. See Enrichment with LLM and VLM.

Vision-language enrichment for images and figures, also used by stage 05 (Enrich). Disabled by default.

@runtime_checkable
class VLM(Protocol):
"""Vision-language enrichment for images/layout. Default: none (disabled)."""
def describe(self, image: bytes, *, prompt: str | None = None) -> str: ...
ParameterDefaultMeaning
image— (required)The raw image bytes to describe.
promptNoneOptional instruction steering the description.

describe(...) returns a textual description as a str.

  • Default: noneNullVLM, a no-op that ships in core and skips vision enrichment.
  • Registry name(s): none, plus installed adapters such as qwen-vl, gpt-4o, vlm-local.

Because the default is none, image understanding is strictly opt-in. Enable it with --vlm <adapter> or [enrich] vlm = "<adapter>". See Enrichment with LLM and VLM.

Turns chunk text into vectors. Driven by stage 06 (Embed+Pack), in batches.

@runtime_checkable
class Embedder(Protocol):
"""Turns text into vectors. Default: openai:text-embedding-3-small."""
dim: int
def embed(self, texts: list[str]) -> list[list[float]]: ...
MemberContract
dimThe vector dimensionality (an int). Pinned into the archive manifest so a loaded archive can validate query-time compatibility.
embed(texts)Accepts a list[str] and returns a list[list[float]] — one vector per input, each of length dim, in the same order.
  • Default: openai:text-embedding-3-smallOpenAIEmbedder, dimension 1536; use bge-m3 for the local profile.
  • Registry name(s): bge-m3 (and installed adapters such as e5, openai, cohere).

The Embed+Pack stage groups chunk texts into batches (default 64) before each embed(...) call. See Choosing an embedder.

The vector-database adapter. Used by stage 06 to persist vectors and to back space.search(...).

@runtime_checkable
class Store(Protocol):
"""Vector database adapter. Default: qdrant.
Also: pgvector, chroma, lancedb, jsonl."""
def upsert(self, ids: list[str], vectors: list[list[float]],
payloads: list[dict]) -> None: ...
def query(self, vector: list[float], k: int = 5,
filter: dict | None = None) -> list[tuple[str, float]]: ...
def persist(self, dest: str) -> None:
"""Flush/export the store into the output `embeddings/` layout."""
MethodContract
upsert(ids, vectors, payloads)Inserts or updates vectors. The three lists are positionally aligned: ids[i] (chunk id), vectors[i] (its embedding), and payloads[i] (its metadata dict). Returns None.
query(vector, k=5, filter=None)Returns the top-k matches as a list[tuple[str, float]] of (id, score), highest score first. An optional filter dict restricts the search (for example by document type).
persist(dest)Flushes/exports the store into the output embeddings/ layout so the resulting .indx archive is portable regardless of backend.
  • Default: qdrantQdrantStore.
  • Registry name(s): qdrant, pgvector, chroma, lancedb, jsonl (the zero-dependency jsonl fallback ships in core and enables a fully offline run).

See Choosing a store and the .indx archive reference.

Serializes a finished KnowledgeSpace to disk. The final responsibility of stage 06.

@runtime_checkable
class OutputWriter(Protocol):
"""Serializes a KnowledgeSpace to disk. Default: .indx.
Also: jsonl, langchain, llamaindex."""
format: str
def write(self, space: KnowledgeSpace, out: str) -> None: ...
MemberContract
formatA string naming the output format this writer emits (for example ".indx").
write(space, out)Serializes the KnowledgeSpace to the output directory out. Returns None.
  • Default: .indxIndxWriter, which seals the portable archive.
  • Registry name(s): .indx, jsonl, langchain, llamaindex (the jsonl writer ships in core).

See Output formats for what each writer emits.

Each component sub-package maintains a registry mapping a short name to a class. A name may carry an optional :model suffix; the base name selects the adapter class and the suffix selects the model.

indx/llm/__init__.py
REGISTRY = {"ollama": OllamaLLM, "none": NullLLM}
def resolve(name: str) -> LLM:
base, _, model = name.partition(":")
cls = REGISTRY[base] # raises StageError(kind="fatal") if missing
return cls(model=model or None)
SlotName stringsClass (default)
parserdoclingDoclingParser
llmopenai:<model>, ollama:<model>, noneOpenAILLM (gpt-5-mini)
vlmnone, <adapter>NullVLM
embedderopenai:text-embedding-3-small, bge-m3 (plus adapters e5, cohere)OpenAIEmbedder (dim 1536)
storeqdrant, pgvector, chroma, lancedb, jsonlQdrantStore
output.indx, jsonl, langchain, llamaindexIndxWriter

For each unset slot the effective component is chosen in this priority order:

explicit object/string > indx.toml > documented default

An explicit instance or name passed in code (or to use()) wins; otherwise indx.toml is consulted; otherwise the documented default applies.

Third-party adapters register additional names through entry-point groups (indx.parsers, indx.llms, indx.vlms, indx.embedders, indx.stores, indx.outputs, indx.stages); once installed, an adapter is usable by name anywhere a built-in is. First-party builtins win on name collisions. For the complete registry mechanics see Registry and defaults and Authoring a plugin.