Component Protocols

indx is built on structural typing: every replaceable part of the pipeline is a Python typing.Protocol with a default implementation, and any object that satisfies the protocol’s shape can be dropped straight in. This page is the authoritative contract for the seven protocols — Stage, Parser, LLM, VLM, Embedder, Store, and OutputWriter — plus the rules for resolving a component by name.

There is no base class to inherit from. To provide your own component you only need to match the method signatures shown below. See Bring your own components for worked examples and Authoring a plugin for packaging one for distribution.

How components are bound

Every swappable component can be supplied two ways:

By object — pass an instance in code, at construction or via DirectoryPipeline.use(...).
By name — a short registry string (e.g. "docling", "qdrant"), optionally with a :model suffix (e.g. "ollama:qwen2.5"), set in code, on the CLI, or in indx.toml.

Both routes converge on the same typed protocol. Name resolution is covered in Resolution by name below and in Registry and defaults.

The `Stage` protocol

A Stage is a single unit of work in a DirectoryPipeline. The pipeline executes stages in registration order, passing one shared SpaceContext from each stage to the next.

from typing import Protocol, runtime_checkable

@runtime_checkable
class Stage(Protocol):
    """A single unit of work in a DirectoryPipeline.

    A stage MUST return the same SpaceContext instance it received, mutated in
    place. The pipeline executes stages in registration order, passing the
    context from each stage to the next.
    """
    name: str  # stable identifier, e.g. "walk", "parse", "chunk", "relate", "enrich", "embed-pack"

    def run(self, ctx: SpaceContext) -> SpaceContext: ...

Member	Contract
`name`	A stable string identifier (`"walk"`, `"parse"`, `"chunk"`, `"relate"`, `"enrich"`, `"embed-pack"`). Stages are addressed by name for `replace(name, ...)` and `drop(name)`.
`run(ctx)`	Does the stage’s work and returns the same `SpaceContext` instance, mutated in place — never a new object and never `None`.

The built-in stages are not registered in any per-slot component registry; they are wired by DirectoryPipeline directly. Third-party stages are discovered through the indx.stages entry-point group. See custom stages for the full authoring recipe and the pipeline reference for stage ordering.

The `Parser` protocol

Converts a single file into a ParsedDoc. Called once per file by stage 02 (Parse).

@runtime_checkable
class Parser(Protocol):
    """Converts a single file into a ParsedDoc. Default: docling."""
    def parse(self, file: "FileRef") -> ParsedDoc: ...

Member	Contract
`parse(file)`	Takes a `FileRef` and returns a fully populated `ParsedDoc` — normalized `text`, plus structural `blocks`, `tables`, and `images` that the Chunk stage uses to split with structure intact.

FileRef is the lightweight handle produced by stage 01 (Walk) and passed to the parser. It carries the resolved path, a bytes accessor, the detected MIME/type, and the folder lineage. In practice a parser reads the file through it (for example file.read_text() or file.path, file.folder) and emits a ParsedDoc whose Source records path, folder, and type.

Default: docling → DoclingParser.
Registry name(s): docling (plus any installed adapters, e.g. unstructured, llamaparse, markitdown, and the zero-dependency plaintext fallback that ships in core).

See Choosing a parser for the trade-offs between engines.

The `LLM` protocol

Text generation used by stage 05 (Enrich) to derive document type, topics, tags, and summaries.

@runtime_checkable
class LLM(Protocol):
    """Text generation for enrichment (type, topics, tags, summaries).
    Default: openai:gpt-5-mini."""
    def complete(self, prompt: str, *, system: str | None = None,
                 max_tokens: int = 512, temperature: float = 0.0) -> str: ...

Parameter	Default	Meaning
`prompt`	— (required)	The user prompt to complete.
`system`	`None`	Optional system instruction.
`max_tokens`	`512`	Upper bound on generated tokens.
`temperature`	`0.0`	Sampling temperature; the default of `0.0` keeps enrichment deterministic.

complete(...) returns the generated text as a plain str.

Default: openai:gpt-5-mini → OpenAILLM running the gpt-5-mini model; use ollama:qwen2.5 for local, no-egress enrichment.
Registry name(s): ollama:<model>, none (and installed cloud adapters such as openai, anthropic, azure, vllm).

Enrichment can be turned off entirely with --llm none, by dropping the enrich stage, or via [enrich] llm = "none" in indx.toml. See Enrichment with LLM and VLM.

The `VLM` protocol

Vision-language enrichment for images and figures, also used by stage 05 (Enrich). Disabled by default.

@runtime_checkable
class VLM(Protocol):
    """Vision-language enrichment for images/layout. Default: none (disabled)."""
    def describe(self, image: bytes, *, prompt: str | None = None) -> str: ...

Parameter	Default	Meaning
`image`	— (required)	The raw image bytes to describe.
`prompt`	`None`	Optional instruction steering the description.

describe(...) returns a textual description as a str.

Default: none → NullVLM, a no-op that ships in core and skips vision enrichment.
Registry name(s): none, plus installed adapters such as qwen-vl, gpt-4o, vlm-local.

Because the default is none, image understanding is strictly opt-in. Enable it with --vlm <adapter> or [enrich] vlm = "<adapter>". See Enrichment with LLM and VLM.

The `Embedder` protocol

Turns chunk text into vectors. Driven by stage 06 (Embed+Pack), in batches.

@runtime_checkable
class Embedder(Protocol):
    """Turns text into vectors. Default: openai:text-embedding-3-small."""
    dim: int
    def embed(self, texts: list[str]) -> list[list[float]]: ...

Member	Contract
`dim`	The vector dimensionality (an `int`). Pinned into the archive manifest so a loaded archive can validate query-time compatibility.
`embed(texts)`	Accepts a `list[str]` and returns a `list[list[float]]` — one vector per input, each of length `dim`, in the same order.

Default: openai:text-embedding-3-small → OpenAIEmbedder, dimension 1536; use bge-m3 for the local profile.
Registry name(s): bge-m3 (and installed adapters such as e5, openai, cohere).

The Embed+Pack stage groups chunk texts into batches (default 64) before each embed(...) call. See Choosing an embedder.

The `Store` protocol

The vector-database adapter. Used by stage 06 to persist vectors and to back space.search(...).

@runtime_checkable
class Store(Protocol):
    """Vector database adapter. Default: qdrant.
    Also: pgvector, chroma, lancedb, jsonl."""
    def upsert(self, ids: list[str], vectors: list[list[float]],
               payloads: list[dict]) -> None: ...
    def query(self, vector: list[float], k: int = 5,
              filter: dict | None = None) -> list[tuple[str, float]]: ...
    def persist(self, dest: str) -> None:
        """Flush/export the store into the output `embeddings/` layout."""

Method	Contract
`upsert(ids, vectors, payloads)`	Inserts or updates vectors. The three lists are positionally aligned: `ids[i]` (chunk id), `vectors[i]` (its embedding), and `payloads[i]` (its metadata `dict`). Returns `None`.
`query(vector, k=5, filter=None)`	Returns the top-`k` matches as a `list[tuple[str, float]]` of `(id, score)`, highest score first. An optional `filter` dict restricts the search (for example by document type).
`persist(dest)`	Flushes/exports the store into the output `embeddings/` layout so the resulting `.indx` archive is portable regardless of backend.

Default: qdrant → QdrantStore.
Registry name(s): qdrant, pgvector, chroma, lancedb, jsonl (the zero-dependency jsonl fallback ships in core and enables a fully offline run).

See Choosing a store and the .indx archive reference.

The `OutputWriter` protocol

Serializes a finished KnowledgeSpace to disk. The final responsibility of stage 06.

@runtime_checkable
class OutputWriter(Protocol):
    """Serializes a KnowledgeSpace to disk. Default: .indx.
    Also: jsonl, langchain, llamaindex."""
    format: str
    def write(self, space: KnowledgeSpace, out: str) -> None: ...

Member	Contract
`format`	A string naming the output format this writer emits (for example `".indx"`).
`write(space, out)`	Serializes the `KnowledgeSpace` to the output directory `out`. Returns `None`.

Default: .indx → IndxWriter, which seals the portable archive.
Registry name(s): .indx, jsonl, langchain, llamaindex (the jsonl writer ships in core).

See Output formats for what each writer emits.

Resolution by name

Each component sub-package maintains a registry mapping a short name to a class. A name may carry an optional :model suffix; the base name selects the adapter class and the suffix selects the model.

REGISTRY = {"ollama": OllamaLLM, "none": NullLLM}

def resolve(name: str) -> LLM:
    base, _, model = name.partition(":")
    cls = REGISTRY[base]               # raises StageError(kind="fatal") if missing
    return cls(model=model or None)

Slots, names, and defaults

Slot	Name strings	Class (default)
`parser`	`docling`	`DoclingParser`
`llm`	`openai:<model>`, `ollama:<model>`, `none`	`OpenAILLM` (`gpt-5-mini`)
`vlm`	`none`, `<adapter>`	`NullVLM`
`embedder`	`openai:text-embedding-3-small`, `bge-m3` (plus adapters `e5`, `cohere`)	`OpenAIEmbedder` (dim 1536)
`store`	`qdrant`, `pgvector`, `chroma`, `lancedb`, `jsonl`	`QdrantStore`
`output`	`.indx`, `jsonl`, `langchain`, `llamaindex`	`IndxWriter`

Resolution order

For each unset slot the effective component is chosen in this priority order:

explicit object/string  >  indx.toml  >  documented default

An explicit instance or name passed in code (or to use()) wins; otherwise indx.toml is consulted; otherwise the documented default applies.

Third-party adapters register additional names through entry-point groups (indx.parsers, indx.llms, indx.vlms, indx.embedders, indx.stores, indx.outputs, indx.stages); once installed, an adapter is usable by name anywhere a built-in is. First-party builtins win on name collisions. For the complete registry mechanics see Registry and defaults and Authoring a plugin.

Component Protocols

How components are bound

The Stage protocol

The Parser protocol

The LLM protocol

The VLM protocol

The Embedder protocol

The Store protocol

The OutputWriter protocol