Python SDK Reference
The indx SDK is the product; the CLI is a thin view of it. Everything you can do from indx <dir> --out <dir> you can do in Python, and vice versa. This page documents the curated public import surface, the DirectoryPipeline builder, and the first-class KnowledgeSpace accessors.
The public import surface
Section titled “The public import surface”All public symbols are re-exported from the top-level indx package and the per-slot sub-packages. Internal modules are not part of the API contract — import only from these paths.
from indx import ( DirectoryPipeline, KnowledgeSpace, Document, Chunk, Relation, RelationType, SpaceContext, SearchHit, SpaceStats, ParsedDoc, Source,)from indx.parsers import Parserfrom indx.llm import LLM, VLMfrom indx.embed import Embedderfrom indx.store import Storefrom indx.output import OutputWriter| Import | Kind | Purpose |
|---|---|---|
DirectoryPipeline | class | The staged builder that turns a directory into a KnowledgeSpace. |
KnowledgeSpace | model | Top-level result; load/save/search/stats. |
Document, Chunk, Relation, Source, ParsedDoc | models | Core data types — see Data Models. |
RelationType | enum | Typed graph edges: sibling, parent, references, continues, duplicate-of. |
SpaceContext | model | The shared, mutable context threaded through every stage. |
SearchHit, SpaceStats | models | Result wrappers returned by search() and stats. |
Parser, LLM, VLM, Embedder, Store, OutputWriter | protocols | The swappable component interfaces — see Protocols. |
DirectoryPipeline
Section titled “DirectoryPipeline”DirectoryPipeline registers the six built-in stages — Walk → Parse → Chunk → Relate → Enrich → Embed+Pack — in canonical order, binds your chosen components, and runs them over a source directory or .zip.
Constructor
Section titled “Constructor”Every component slot accepts an instance, a name string, or None. An unset slot falls back to indx.toml, then to the documented default.
DirectoryPipeline( *, parser: Parser | str | None = None, # default: "docling" llm: LLM | str | None = None, # default: "openai:gpt-5-mini" vlm: VLM | str | None = None, # default: "none" embedder: Embedder | str | None = None, # default: "openai:text-embedding-3-small" store: Store | str | None = None, # default: "qdrant" output: OutputWriter | str | None = None, # default: ".indx" config: str | IndxConfig | None = None, # path to indx.toml or a config object)| Kwarg | Accepts | Default | Notes |
|---|---|---|---|
parser | Parser instance, name, None | docling | The Parse stage engine. |
llm | LLM instance, name, None | openai:gpt-5-mini | Enrichment text model; pass "none" to disable or "ollama:qwen2.5" for local. |
vlm | VLM instance, name, None | none | Vision-language enrichment; off by default. |
embedder | Embedder instance, name, None | openai:text-embedding-3-small | Vectorizer; dim is pinned in the archive. |
store | Store instance, name, None | qdrant | Vector backend. Also pgvector, chroma, lancedb, jsonl. |
output | OutputWriter instance, name, None | .indx | Writer. Also jsonl, langchain, llamaindex. |
config | path string or IndxConfig | ./indx.toml if present | Resolved configuration (see Configuration). |
Name strings may carry an optional :model suffix (e.g. openai:gpt-5-mini or ollama:qwen2.5). Unknown names raise a fatal error before any stage runs. The resolution order for each slot is: explicit code argument / use() → CLI flag → indx.toml → documented default.
Component binding
Section titled “Component binding”def use(self, **components) -> DirectoryPipelineSwap components by keyword after construction: parser=, llm=, vlm=, embedder=, store=, output=. Accepts instances or name strings and returns self for chaining.
pipeline = ( DirectoryPipeline(embedder="bge-m3") .use(store="chroma", llm="none"))Stage management
Section titled “Stage management”A fresh pipeline carries the six built-in stages, addressable by name ("walk", "parse", "chunk", "relate", "enrich", "embed-pack"). These methods reshape the stage list; each returns self for chaining.
| Method | Signature | Effect |
|---|---|---|
stages | stages() -> list[Stage] | The current ordered stage list. |
insert | insert(index: int, stage: Stage) -> DirectoryPipeline | Insert a custom stage at a 0-based position. |
append | append(stage: Stage) -> DirectoryPipeline | Append a stage to the end. |
replace | replace(name: str, stage: Stage) -> DirectoryPipeline | Replace the stage with the given name. |
drop | drop(name: str) -> DirectoryPipeline | Remove the named stage. |
drop("enrich") is a common operation when no LLM is available or wanted; drop("embed-pack") (or the --no-embed CLI flag) produces a graph-only space with no vectors. A custom stage must satisfy the Stage protocol — a name: str attribute and run(ctx: SpaceContext) -> SpaceContext that returns the same context, mutated. See Custom stages.
from indx import SpaceContext
class PiiRedactStage: name = "pii-redact" def run(self, ctx: SpaceContext) -> SpaceContext: for chunk in ctx.chunks: chunk.text = redact(chunk.text) return ctx # MUST return the same context
pipeline.insert(3, PiiRedactStage()) # run after Chunk, before RelateExecution
Section titled “Execution”def run(self, src: str, out: str | None = None) -> KnowledgeSpaceExecutes every registered stage over src (a directory or .zip), writes the output layout to out when set, and returns the resulting KnowledgeSpace. When out is omitted the space is returned in memory without sealing an archive to disk.
KnowledgeSpace
Section titled “KnowledgeSpace”KnowledgeSpace is the top-level result: the document graph, chunks, embeddings, and metadata. It serializes to a single portable .indx archive and exposes first-class accessors for stats, document filtering, and semantic search. (Full field documentation lives in Data Models.)
@propertydef stats(self) -> SpaceStatsAggregate statistics for the space: documents, chunks, relations, embeddings, embed_dim, a types histogram (document count per detected type), and bytes_source.
s = space.statsprint(s.documents, s.chunks, s.embed_dim) # e.g. 128 1042 1536print(s.types) # {'policy': 40, 'guide': 30, ...}documents(type=None)
Section titled “documents(type=None)”def documents(self, type: str | None = None) -> list[Document]Returns the space’s documents, optionally filtered to a single detected type.
for doc in space.documents(type="policy"): print(doc.path, doc.topics, doc.summary)search(query, k=5)
Section titled “search(query, k=5)”def search(self, query: str, k: int = 5) -> list[SearchHit]Embeds query with the space’s embedder and returns the top-k chunks by similarity. Each result is a SearchHit exposing .chunk, a .score (higher is better), resolved .neighbors (the adjacent chunks, for context windows), and a .source property that proxies chunk.source.
for hit in space.search("how long is data retained?", k=3): print(f"{hit.score:.3f} {hit.source.path}") print(hit.chunk.text) print("context:", [c.id for c in hit.neighbors])load(archive) / save(archive)
Section titled “load(archive) / save(archive)”@classmethoddef load(cls, archive: str) -> KnowledgeSpace
def save(self, archive: str) -> Noneload opens a sealed .indx archive: it checks the archive’s major version is compatible, validates checksums, and reconstructs the in-memory models (vectors are memory-mapped on demand). save seals the current space into a portable .indx archive. An incompatible or corrupt archive raises a fatal load error.
space = KnowledgeSpace.load("./ai-ready/handbook.indx")hits = space.search("gdpr compliance", k=5)space.save("./backup/handbook.indx")End-to-end examples
Section titled “End-to-end examples”Basic run
Section titled “Basic run”from indx import DirectoryPipeline
pipeline = DirectoryPipeline( parser="docling", llm="openai:gpt-5-mini", embedder="openai:text-embedding-3-small", store="qdrant",)space = pipeline.run("./docs", "./ai-ready")
print(space.stats.documents, space.stats.chunks)for doc in space.documents(type="policy"): print(doc.path, doc.topics)
for hit in space.search("how long is data retained?", k=3): print(hit.score, hit.source.path) print(hit.chunk.text) print("context:", [c.id for c in hit.neighbors])Bring your own component
Section titled “Bring your own component”Any slot can be a custom object that satisfies its protocol — passed at construction or via use(). Structural typing means you do not subclass anything.
from indx import DirectoryPipeline, ParsedDoc, Sourcefrom indx.parsers import Parser
class MyMarkdownParser: """Custom Parser: satisfies the Parser protocol structurally.""" def parse(self, file) -> ParsedDoc: text = file.read_text() return ParsedDoc( source=Source(path=file.path, folder=file.folder, type="markdown"), text=text, blocks=[{"kind": "paragraph", "text": p} for p in text.split("\n\n")], )
pipeline = ( DirectoryPipeline(embedder="bge-m3", store="chroma") .use(parser=MyMarkdownParser()) # BYO parser instance .drop("enrich") # skip LLM enrichment entirely)space = pipeline.run("./notes", "./out")See Custom components and Bring your own stack for the full pattern.
Load and query an existing archive
Section titled “Load and query an existing archive”from indx import KnowledgeSpace
space = KnowledgeSpace.load("./ai-ready/handbook.indx")
print("docs:", space.stats.documents, "dim:", space.stats.embed_dim)
for hit in space.search("incident response runbook", k=5): print(f"[{hit.score:.3f}] {hit.source.path} ({hit.source.type})") print(" ", hit.chunk.text[:120], "…") print(" neighbors:", [c.id for c in hit.neighbors])See also
Section titled “See also”- CLI Reference — the matching command-line surface.
- Data Models — full field reference for every model returned here.
- Protocols — the
Parser,LLM,VLM,Embedder,Store,OutputWriter, andStageinterfaces. - Configuration — what
config=resolves and the precedence rules.