Skip to content

Bringing Your Own Component

indx is built so that every heavy-lifting component is a swappable adapter behind a typed protocol. When the built-in adapters do not fit, you can supply your own object and hand it to the pipeline — no base class to inherit, no fork of indx, no edits to core/.

This guide covers bring-your-own (BYO) components written in code: a one-off class you instantiate and pass into a DirectoryPipeline. If you want to publish your adapter as an installable package that resolves by name (so others can write store = "weaviate" in indx.toml), see authoring a plugin instead. To add a stage rather than a component, see writing a custom stage.

indx interfaces are typing.Protocol definitions, not abstract base classes. That means indx uses structural typing (“duck typing with type-checking”): any object whose methods match the protocol’s signatures satisfies it. You do not import and subclass Parser — you simply implement parse(...) with the right shape, and your object drops straight in.

The six component slots and their protocols (from the technical spec) are:

SlotProtocolKey method(s)Default
parserParserparse(file) -> ParsedDocdocling
llmLLMcomplete(prompt, *, system=, max_tokens=, temperature=) -> stropenai:gpt-5-mini
vlmVLMdescribe(image, *, prompt=) -> strnone
embedderEmbedderdim: int; embed(texts) -> list[list[float]]openai:text-embedding-3-small (dim 1536)
storeStoreupsert(...), query(...), persist(dest)qdrant
outputOutputWriterformat: str; write(space, out) -> None.indx

The Default column is the shipped zero-config cloud stack (it needs OPENAI_API_KEY). The all-local stack — ollama:qwen2.5 + bge-m3 (dim 1024) — is the local profile (opt-in), selected via pip install "indx[local]" or explicit flags/config; see registry and defaults.

See the protocols reference for the exact, complete signatures of every slot.

A Parser turns one file into a ParsedDoc. Here is a minimal Markdown parser that splits on blank lines into paragraph blocks. Note that it is a plain class — it implements parse and returns a core ParsedDoc, nothing more.

from indx import DirectoryPipeline, ParsedDoc, Source
class MyMarkdownParser:
"""Custom Parser: satisfies the Parser protocol structurally."""
def parse(self, file) -> ParsedDoc:
text = file.read_text()
return ParsedDoc(
source=Source(path=file.path, folder=file.folder, type="markdown"),
text=text,
blocks=[{"kind": "paragraph", "text": p} for p in text.split("\n\n")],
)
pipeline = DirectoryPipeline(embedder="bge-m3", store="chroma")
pipeline.use(parser=MyMarkdownParser()) # BYO parser instance
space = pipeline.run("./notes", "./out")

The file argument is a FileRef produced by stage 01 (Walk): it carries the resolved path, a bytes/text accessor, the detected MIME type, and the folder lineage. Your job is to read it and return normalized text plus any structural blocks, tables, or images the Chunk stage can use to split with structure intact.

Example 2 — chaining .use(...) and .drop(...)

Section titled “Example 2 — chaining .use(...) and .drop(...)”

use() returns the pipeline, so component swaps chain fluently. You can also remove a stage entirely — drop("enrich") skips all LLM/VLM work, which is handy when you have no model available or simply do not want enrichment.

from indx import DirectoryPipeline, ParsedDoc, Source
class MyMarkdownParser:
def parse(self, file) -> ParsedDoc:
text = file.read_text()
return ParsedDoc(
source=Source(path=file.path, folder=file.folder, type="markdown"),
text=text,
blocks=[{"kind": "paragraph", "text": p} for p in text.split("\n\n")],
)
pipeline = (
DirectoryPipeline(embedder="bge-m3", store="chroma")
.use(parser=MyMarkdownParser()) # BYO parser instance
.drop("enrich") # skip LLM enrichment entirely
)
space = pipeline.run("./notes", "./out")

Dropping enrich removes the only stage that would call an LLM or VLM, so documents keep their detected type but get no LLM-derived topics, tags, or summaries. Dropping embed-pack instead produces a graph-only space with no vectors. See the pipeline overview for what each stage contributes.

There are two equivalent ways to bind a component, and they accept either an instance (your BYO object) or a name string (a registered adapter).

# (a) At construction
pipeline = DirectoryPipeline(
parser=MyMarkdownParser(), # instance
embedder="bge-m3", # name string
store="chroma",
)
# (b) Via use() after construction (chainable, returns self)
pipeline = DirectoryPipeline(embedder="bge-m3", store="chroma")
pipeline.use(parser=MyMarkdownParser())

Both forms are interchangeable; pick whichever reads better. The keyword names are the same in both: parser=, llm=, vlm=, embedder=, store=, output=.

For any slot you leave unset, indx resolves the effective component using this precedence:

explicit code argument / use() > CLI flag > indx.toml > documented default

So a BYO object passed in code always wins over config or defaults. See the configuration guide and configuration reference for the full resolution rules.

Whether your component is a throwaway class or a future plugin, it should honour the same five-point contract from the spec. Following it keeps your adapter portable, import-safe, and friendly when a dependency is missing.

Match the method names and signatures in the protocols reference precisely. Structural typing means a near-miss (wrong argument name, wrong return type) will not raise at bind time but will fail mid-run. Run mypy/pyright against your adapter to catch mismatches early.

2. Convert at the edge — never leak vendor types into core models

Section titled “2. Convert at the edge — never leak vendor types into core models”

A ParsedDoc, Chunk, or Document must never hold a vendor object (a raw provider response, a qdrant_client.PointStruct, a LangChain Document, etc.). Do all conversion to and from core types inside your adapter, at its boundary. Core models stay vendor-free so the resulting .indx archive is portable regardless of which backend produced it.

class MyStore:
def upsert(self, ids, vectors, payloads) -> None:
from my_vendor_sdk import Point # vendor type stays local
points = [Point(id=i, vec=v, payload=p)
for i, v, p in zip(ids, vectors, payloads)]
self._client.upsert(points) # convert here, not in core/

3. Lazy-import heavy deps, with a MissingDependencyError hint

Section titled “3. Lazy-import heavy deps, with a MissingDependencyError hint”

Do not import a heavy or optional dependency at module top level — that would break indx’s light, air-gapped core. Import it inside the method that needs it and raise MissingDependencyError with an actionable pip install hint when it is absent.

from indx.core.errors import MissingDependencyError
class MyEmbedder:
dim = 768
def embed(self, texts: list[str]) -> list[list[float]]:
try:
from sentence_transformers import SentenceTransformer # lazy
except ModuleNotFoundError as exc:
raise MissingDependencyError(
"MyEmbedder requires sentence-transformers. "
"Install it with: pip install sentence-transformers"
) from exc
model = SentenceTransformer("my-model")
return [list(v) for v in model.encode(texts)]

Importing your adapter module must never fail because a backend is absent or a service is unreachable. Defer all heavy work — SDK imports, network connections, model loading — to construction or method-call time, not module import. This is what lets the registry discover adapters cheaply without paying for backends you are not using.

If you write a custom Store, implement persist(dest) to flush your vectors into the embeddings/ layout. That keeps the sealed .indx archive self-contained and loadable even when your backing database is offline. See the embed + pack stage and the .indx archive reference for the on-disk format.