Skip to content

Python SDK Reference

The indx SDK is the product; the CLI is a thin view of it. Everything you can do from indx <dir> --out <dir> you can do in Python, and vice versa. This page documents the curated public import surface, the DirectoryPipeline builder, and the first-class KnowledgeSpace accessors.

All public symbols are re-exported from the top-level indx package and the per-slot sub-packages. Internal modules are not part of the API contract — import only from these paths.

from indx import (
DirectoryPipeline,
KnowledgeSpace, Document, Chunk, Relation, RelationType,
SpaceContext, SearchHit, SpaceStats, ParsedDoc, Source,
)
from indx.parsers import Parser
from indx.llm import LLM, VLM
from indx.embed import Embedder
from indx.store import Store
from indx.output import OutputWriter
ImportKindPurpose
DirectoryPipelineclassThe staged builder that turns a directory into a KnowledgeSpace.
KnowledgeSpacemodelTop-level result; load/save/search/stats.
Document, Chunk, Relation, Source, ParsedDocmodelsCore data types — see Data Models.
RelationTypeenumTyped graph edges: sibling, parent, references, continues, duplicate-of.
SpaceContextmodelThe shared, mutable context threaded through every stage.
SearchHit, SpaceStatsmodelsResult wrappers returned by search() and stats.
Parser, LLM, VLM, Embedder, Store, OutputWriterprotocolsThe swappable component interfaces — see Protocols.

DirectoryPipeline registers the six built-in stages — Walk → Parse → Chunk → Relate → Enrich → Embed+Pack — in canonical order, binds your chosen components, and runs them over a source directory or .zip.

Every component slot accepts an instance, a name string, or None. An unset slot falls back to indx.toml, then to the documented default.

DirectoryPipeline(
*,
parser: Parser | str | None = None, # default: "docling"
llm: LLM | str | None = None, # default: "openai:gpt-5-mini"
vlm: VLM | str | None = None, # default: "none"
embedder: Embedder | str | None = None, # default: "openai:text-embedding-3-small"
store: Store | str | None = None, # default: "qdrant"
output: OutputWriter | str | None = None, # default: ".indx"
config: str | IndxConfig | None = None, # path to indx.toml or a config object
)
KwargAcceptsDefaultNotes
parserParser instance, name, NonedoclingThe Parse stage engine.
llmLLM instance, name, Noneopenai:gpt-5-miniEnrichment text model; pass "none" to disable or "ollama:qwen2.5" for local.
vlmVLM instance, name, NonenoneVision-language enrichment; off by default.
embedderEmbedder instance, name, Noneopenai:text-embedding-3-smallVectorizer; dim is pinned in the archive.
storeStore instance, name, NoneqdrantVector backend. Also pgvector, chroma, lancedb, jsonl.
outputOutputWriter instance, name, None.indxWriter. Also jsonl, langchain, llamaindex.
configpath string or IndxConfig./indx.toml if presentResolved configuration (see Configuration).

Name strings may carry an optional :model suffix (e.g. openai:gpt-5-mini or ollama:qwen2.5). Unknown names raise a fatal error before any stage runs. The resolution order for each slot is: explicit code argument / use() → CLI flag → indx.toml → documented default.

def use(self, **components) -> DirectoryPipeline

Swap components by keyword after construction: parser=, llm=, vlm=, embedder=, store=, output=. Accepts instances or name strings and returns self for chaining.

pipeline = (
DirectoryPipeline(embedder="bge-m3")
.use(store="chroma", llm="none")
)

A fresh pipeline carries the six built-in stages, addressable by name ("walk", "parse", "chunk", "relate", "enrich", "embed-pack"). These methods reshape the stage list; each returns self for chaining.

MethodSignatureEffect
stagesstages() -> list[Stage]The current ordered stage list.
insertinsert(index: int, stage: Stage) -> DirectoryPipelineInsert a custom stage at a 0-based position.
appendappend(stage: Stage) -> DirectoryPipelineAppend a stage to the end.
replacereplace(name: str, stage: Stage) -> DirectoryPipelineReplace the stage with the given name.
dropdrop(name: str) -> DirectoryPipelineRemove the named stage.

drop("enrich") is a common operation when no LLM is available or wanted; drop("embed-pack") (or the --no-embed CLI flag) produces a graph-only space with no vectors. A custom stage must satisfy the Stage protocol — a name: str attribute and run(ctx: SpaceContext) -> SpaceContext that returns the same context, mutated. See Custom stages.

from indx import SpaceContext
class PiiRedactStage:
name = "pii-redact"
def run(self, ctx: SpaceContext) -> SpaceContext:
for chunk in ctx.chunks:
chunk.text = redact(chunk.text)
return ctx # MUST return the same context
pipeline.insert(3, PiiRedactStage()) # run after Chunk, before Relate
def run(self, src: str, out: str | None = None) -> KnowledgeSpace

Executes every registered stage over src (a directory or .zip), writes the output layout to out when set, and returns the resulting KnowledgeSpace. When out is omitted the space is returned in memory without sealing an archive to disk.

KnowledgeSpace is the top-level result: the document graph, chunks, embeddings, and metadata. It serializes to a single portable .indx archive and exposes first-class accessors for stats, document filtering, and semantic search. (Full field documentation lives in Data Models.)

@property
def stats(self) -> SpaceStats

Aggregate statistics for the space: documents, chunks, relations, embeddings, embed_dim, a types histogram (document count per detected type), and bytes_source.

s = space.stats
print(s.documents, s.chunks, s.embed_dim) # e.g. 128 1042 1536
print(s.types) # {'policy': 40, 'guide': 30, ...}
def documents(self, type: str | None = None) -> list[Document]

Returns the space’s documents, optionally filtered to a single detected type.

for doc in space.documents(type="policy"):
print(doc.path, doc.topics, doc.summary)
def search(self, query: str, k: int = 5) -> list[SearchHit]

Embeds query with the space’s embedder and returns the top-k chunks by similarity. Each result is a SearchHit exposing .chunk, a .score (higher is better), resolved .neighbors (the adjacent chunks, for context windows), and a .source property that proxies chunk.source.

for hit in space.search("how long is data retained?", k=3):
print(f"{hit.score:.3f} {hit.source.path}")
print(hit.chunk.text)
print("context:", [c.id for c in hit.neighbors])
@classmethod
def load(cls, archive: str) -> KnowledgeSpace
def save(self, archive: str) -> None

load opens a sealed .indx archive: it checks the archive’s major version is compatible, validates checksums, and reconstructs the in-memory models (vectors are memory-mapped on demand). save seals the current space into a portable .indx archive. An incompatible or corrupt archive raises a fatal load error.

space = KnowledgeSpace.load("./ai-ready/handbook.indx")
hits = space.search("gdpr compliance", k=5)
space.save("./backup/handbook.indx")
from indx import DirectoryPipeline
pipeline = DirectoryPipeline(
parser="docling",
llm="openai:gpt-5-mini",
embedder="openai:text-embedding-3-small",
store="qdrant",
)
space = pipeline.run("./docs", "./ai-ready")
print(space.stats.documents, space.stats.chunks)
for doc in space.documents(type="policy"):
print(doc.path, doc.topics)
for hit in space.search("how long is data retained?", k=3):
print(hit.score, hit.source.path)
print(hit.chunk.text)
print("context:", [c.id for c in hit.neighbors])

Any slot can be a custom object that satisfies its protocol — passed at construction or via use(). Structural typing means you do not subclass anything.

from indx import DirectoryPipeline, ParsedDoc, Source
from indx.parsers import Parser
class MyMarkdownParser:
"""Custom Parser: satisfies the Parser protocol structurally."""
def parse(self, file) -> ParsedDoc:
text = file.read_text()
return ParsedDoc(
source=Source(path=file.path, folder=file.folder, type="markdown"),
text=text,
blocks=[{"kind": "paragraph", "text": p} for p in text.split("\n\n")],
)
pipeline = (
DirectoryPipeline(embedder="bge-m3", store="chroma")
.use(parser=MyMarkdownParser()) # BYO parser instance
.drop("enrich") # skip LLM enrichment entirely
)
space = pipeline.run("./notes", "./out")

See Custom components and Bring your own stack for the full pattern.

from indx import KnowledgeSpace
space = KnowledgeSpace.load("./ai-ready/handbook.indx")
print("docs:", space.stats.documents, "dim:", space.stats.embed_dim)
for hit in space.search("incident response runbook", k=5):
print(f"[{hit.score:.3f}] {hit.source.path} ({hit.source.type})")
print(" ", hit.chunk.text[:120], "")
print(" neighbors:", [c.id for c in hit.neighbors])
  • CLI Reference — the matching command-line surface.
  • Data Models — full field reference for every model returned here.
  • Protocols — the Parser, LLM, VLM, Embedder, Store, OutputWriter, and Stage interfaces.
  • Configuration — what config= resolves and the precedence rules.