Python SDK Reference

The indx SDK is the product; the CLI is a thin view of it. Everything you can do from indx <dir> --out <dir> you can do in Python, and vice versa. This page documents the curated public import surface, the DirectoryPipeline builder, and the first-class KnowledgeSpace accessors.

The public import surface

All public symbols are re-exported from the top-level indx package and the per-slot sub-packages. Internal modules are not part of the API contract — import only from these paths.

from indx import (
    DirectoryPipeline,
    KnowledgeSpace, Document, Chunk, Relation, RelationType,
    SpaceContext, SearchHit, SpaceStats, ParsedDoc, Source,
)
from indx.parsers import Parser
from indx.llm     import LLM, VLM
from indx.embed   import Embedder
from indx.store   import Store
from indx.output  import OutputWriter

Import	Kind	Purpose
`DirectoryPipeline`	class	The staged builder that turns a directory into a `KnowledgeSpace`.
`KnowledgeSpace`	model	Top-level result; load/save/search/stats.
`Document`, `Chunk`, `Relation`, `Source`, `ParsedDoc`	models	Core data types — see Data Models.
`RelationType`	enum	Typed graph edges: `sibling`, `parent`, `references`, `continues`, `duplicate-of`.
`SpaceContext`	model	The shared, mutable context threaded through every stage.
`SearchHit`, `SpaceStats`	models	Result wrappers returned by `search()` and `stats`.
`Parser`, `LLM`, `VLM`, `Embedder`, `Store`, `OutputWriter`	protocols	The swappable component interfaces — see Protocols.

`DirectoryPipeline`

DirectoryPipeline registers the six built-in stages — Walk → Parse → Chunk → Relate → Enrich → Embed+Pack — in canonical order, binds your chosen components, and runs them over a source directory or .zip.

Constructor

Every component slot accepts an instance, a name string, or None. An unset slot falls back to indx.toml, then to the documented default.

DirectoryPipeline(
    *,
    parser:   Parser       | str | None = None,   # default: "docling"
    llm:      LLM          | str | None = None,   # default: "openai:gpt-5-mini"
    vlm:      VLM          | str | None = None,   # default: "none"
    embedder: Embedder     | str | None = None,   # default: "openai:text-embedding-3-small"
    store:    Store        | str | None = None,   # default: "qdrant"
    output:   OutputWriter | str | None = None,   # default: ".indx"
    config:   str | IndxConfig    | None = None,   # path to indx.toml or a config object
)

Kwarg	Accepts	Default	Notes
`parser`	`Parser` instance, name, `None`	`docling`	The Parse stage engine.
`llm`	`LLM` instance, name, `None`	`openai:gpt-5-mini`	Enrichment text model; pass `"none"` to disable or `"ollama:qwen2.5"` for local.
`vlm`	`VLM` instance, name, `None`	`none`	Vision-language enrichment; off by default.
`embedder`	`Embedder` instance, name, `None`	`openai:text-embedding-3-small`	Vectorizer; `dim` is pinned in the archive.
`store`	`Store` instance, name, `None`	`qdrant`	Vector backend. Also `pgvector`, `chroma`, `lancedb`, `jsonl`.
`output`	`OutputWriter` instance, name, `None`	`.indx`	Writer. Also `jsonl`, `langchain`, `llamaindex`.
`config`	path string or `IndxConfig`	`./indx.toml` if present	Resolved configuration (see Configuration).

Name strings may carry an optional :model suffix (e.g. openai:gpt-5-mini or ollama:qwen2.5). Unknown names raise a fatal error before any stage runs. The resolution order for each slot is: explicit code argument / use() → CLI flag → indx.toml → documented default.

Component binding

def use(self, **components) -> DirectoryPipeline

Swap components by keyword after construction: parser=, llm=, vlm=, embedder=, store=, output=. Accepts instances or name strings and returns self for chaining.

pipeline = (
    DirectoryPipeline(embedder="bge-m3")
    .use(store="chroma", llm="none")
)

Stage management

A fresh pipeline carries the six built-in stages, addressable by name ("walk", "parse", "chunk", "relate", "enrich", "embed-pack"). These methods reshape the stage list; each returns self for chaining.

Method	Signature	Effect
`stages`	`stages() -> list[Stage]`	The current ordered stage list.
`insert`	`insert(index: int, stage: Stage) -> DirectoryPipeline`	Insert a custom stage at a 0-based position.
`append`	`append(stage: Stage) -> DirectoryPipeline`	Append a stage to the end.
`replace`	`replace(name: str, stage: Stage) -> DirectoryPipeline`	Replace the stage with the given `name`.
`drop`	`drop(name: str) -> DirectoryPipeline`	Remove the named stage.

drop("enrich") is a common operation when no LLM is available or wanted; drop("embed-pack") (or the --no-embed CLI flag) produces a graph-only space with no vectors. A custom stage must satisfy the Stage protocol — a name: str attribute and run(ctx: SpaceContext) -> SpaceContext that returns the same context, mutated. See Custom stages.

from indx import SpaceContext

class PiiRedactStage:
    name = "pii-redact"
    def run(self, ctx: SpaceContext) -> SpaceContext:
        for chunk in ctx.chunks:
            chunk.text = redact(chunk.text)
        return ctx                       # MUST return the same context

pipeline.insert(3, PiiRedactStage())     # run after Chunk, before Relate

Execution

def run(self, src: str, out: str | None = None) -> KnowledgeSpace

Executes every registered stage over src (a directory or .zip), writes the output layout to out when set, and returns the resulting KnowledgeSpace. When out is omitted the space is returned in memory without sealing an archive to disk.

`KnowledgeSpace`

KnowledgeSpace is the top-level result: the document graph, chunks, embeddings, and metadata. It serializes to a single portable .indx archive and exposes first-class accessors for stats, document filtering, and semantic search. (Full field documentation lives in Data Models.)

`stats`

@property
def stats(self) -> SpaceStats

Aggregate statistics for the space: documents, chunks, relations, embeddings, embed_dim, a types histogram (document count per detected type), and bytes_source.

s = space.stats
print(s.documents, s.chunks, s.embed_dim)   # e.g. 128 1042 1536
print(s.types)                               # {'policy': 40, 'guide': 30, ...}

`documents(type=None)`

def documents(self, type: str | None = None) -> list[Document]

Returns the space’s documents, optionally filtered to a single detected type.

for doc in space.documents(type="policy"):
    print(doc.path, doc.topics, doc.summary)

`search(query, k=5)`

def search(self, query: str, k: int = 5) -> list[SearchHit]

Embeds query with the space’s embedder and returns the top-k chunks by similarity. Each result is a SearchHit exposing .chunk, a .score (higher is better), resolved .neighbors (the adjacent chunks, for context windows), and a .source property that proxies chunk.source.

for hit in space.search("how long is data retained?", k=3):
    print(f"{hit.score:.3f}  {hit.source.path}")
    print(hit.chunk.text)
    print("context:", [c.id for c in hit.neighbors])

`load(archive)` / `save(archive)`

@classmethod
def load(cls, archive: str) -> KnowledgeSpace

def save(self, archive: str) -> None

load opens a sealed .indx archive: it checks the archive’s major version is compatible, validates checksums, and reconstructs the in-memory models (vectors are memory-mapped on demand). save seals the current space into a portable .indx archive. An incompatible or corrupt archive raises a fatal load error.

space = KnowledgeSpace.load("./ai-ready/handbook.indx")
hits = space.search("gdpr compliance", k=5)
space.save("./backup/handbook.indx")

End-to-end examples

Basic run

from indx import DirectoryPipeline

pipeline = DirectoryPipeline(
    parser="docling",
    llm="openai:gpt-5-mini",
    embedder="openai:text-embedding-3-small",
    store="qdrant",
)
space = pipeline.run("./docs", "./ai-ready")

print(space.stats.documents, space.stats.chunks)
for doc in space.documents(type="policy"):
    print(doc.path, doc.topics)

for hit in space.search("how long is data retained?", k=3):
    print(hit.score, hit.source.path)
    print(hit.chunk.text)
    print("context:", [c.id for c in hit.neighbors])

Bring your own component

Any slot can be a custom object that satisfies its protocol — passed at construction or via use(). Structural typing means you do not subclass anything.

from indx import DirectoryPipeline, ParsedDoc, Source
from indx.parsers import Parser

class MyMarkdownParser:
    """Custom Parser: satisfies the Parser protocol structurally."""
    def parse(self, file) -> ParsedDoc:
        text = file.read_text()
        return ParsedDoc(
            source=Source(path=file.path, folder=file.folder, type="markdown"),
            text=text,
            blocks=[{"kind": "paragraph", "text": p} for p in text.split("\n\n")],
        )

pipeline = (
    DirectoryPipeline(embedder="bge-m3", store="chroma")
    .use(parser=MyMarkdownParser())   # BYO parser instance
    .drop("enrich")                   # skip LLM enrichment entirely
)
space = pipeline.run("./notes", "./out")

See Custom components and Bring your own stack for the full pattern.

Load and query an existing archive

from indx import KnowledgeSpace

space = KnowledgeSpace.load("./ai-ready/handbook.indx")

print("docs:", space.stats.documents, "dim:", space.stats.embed_dim)

for hit in space.search("incident response runbook", k=5):
    print(f"[{hit.score:.3f}] {hit.source.path} ({hit.source.type})")
    print("  ", hit.chunk.text[:120], "…")
    print("   neighbors:", [c.id for c in hit.neighbors])