Skip to content

Configuring indx (indx.toml)

indx.toml is the optional, declarative way to pin which components your pipeline uses and how they behave. It is never required: the documented defaults make a full run work with cloud-backed model defaults, while the local profile remains available when the run must stay offline. Reach for a config file when you want a build to be reproducible, shareable, and explicit about the stack it ran on.

Every key in indx.toml has a documented default, so a bare run resolves a complete stack on its own:

Terminal window
indx ./docs --out ./ai-ready

This uses parser docling, llm openai:gpt-5-mini, vlm none, embedder openai:text-embedding-3-small, store qdrant, and output .indx — all resolvable without a config file once the selected extras are installed. A config file simply lets you write those choices down, override individual slots, and pass backend-specific options that the CLI flags don’t cover.

For each component slot, indx resolves the effective value from four layers, highest priority first:

explicit code argument / use() > CLI flag > indx.toml > documented default
LayerExampleWins over
Explicit code arg / use()DirectoryPipeline(store="chroma") or .use(store="chroma")everything below
CLI flagindx ./docs -o ./out --store chromaindx.toml and the default
indx.toml[store] backend = "chroma"the documented default
Documented defaultstore = "qdrant"

A name that doesn’t resolve to a registered component (in any layer) is a fatal error raised before any stage runs, so misconfiguration fails fast rather than mid-build. See errors and exit codes — a bad config file or unknown component name exits with code 3.

Every section and key below is optional; omitted keys fall back to the documented default. The sections map one-to-one onto the pipeline’s component slots.

[parser]
engine = "docling" # str. Parser name. Default: "docling".
[enrich]
llm = "openai:gpt-5-mini" # str. LLM name[:model] or "none". Default: "openai:gpt-5-mini".
vlm = "none" # str. VLM name or "none". Default: "none".
metadata = ["type", "topics", "tags", "summary"]
# list[str]. Which enrichments to produce.
# Default: ["type", "topics", "tags", "summary"].
[embed]
model = "openai:text-embedding-3-small" # str. Embedder name. Default cloud embedder.
[store]
backend = "qdrant" # str. One of: qdrant | pgvector | chroma | lancedb | jsonl.
# Default: "qdrant".
[output]
format = ".indx" # str. One of: .indx | jsonl | langchain | llamaindex.
# Default: ".indx".
SectionKeyTypeDefaultAllowed values
[parser]enginestringdoclingany registered parser name
[enrich]llmstringopenai:gpt-5-mini<name>[:model], none
[enrich]vlmstringnone<name>, none
[enrich]metadatalist[str]["type","topics","tags","summary"]subset of those four
[embed]modelstringopenai:text-embedding-3-smallany registered embedder name
[store]backendstringqdrantqdrant, pgvector, chroma, lancedb, jsonl
[output]formatstring.indx.indx, jsonl, langchain, llamaindex

For the exhaustive table with every key, type, and constraint, see the configuration reference.

Secrets come from the environment, never the file

Section titled “Secrets come from the environment, never the file”

indx.toml is meant to be committed and shared, so it must not contain credentials. API keys and other secrets are supplied through environment variables and layered in by pydantic-settings, never written into the file.

Terminal window
export INDX_LLM__API_KEY="sk-..." # double underscore = nested setting
indx ./docs --out ./ai-ready --llm openai:gpt-5-mini
[enrich]
llm = "openai:gpt-5-mini" # the model choice lives here…
# …the api_key comes from $INDX_LLM__API_KEY, not this file
# pick a different model with the :model suffix, e.g. "openai:gpt-4o"

The env prefix is keyed on the slot name, not the TOML section heading — so the LLM’s key is INDX_LLM__… even though the LLM is configured under [enrich]. The double underscore separates the slot from the nested setting (INDX_<SLOT>__<SETTING>):

SlotTOML locationEnv prefix
parser[parser] engineINDX_PARSER__…
llm[enrich] llmINDX_LLM__…
vlm[enrich] vlmINDX_VLM__…
embedder[embed] modelINDX_EMBEDDER__…
store[store] backendINDX_STORE__…

Most adapters accept options beyond the simple slot name. Those live in a sub-table named after the backend (e.g. [store.qdrant]). indx passes the keys in such a sub-table verbatim to the adapter constructor — they are opaque to the core, so each backend documents its own keys.

[store]
backend = "qdrant"
[store.qdrant]
url = "http://localhost:6333" # passed straight through to the Qdrant adapter
# collection = "handbook" # any further keys are adapter-defined

The same pattern applies to other slots and to third-party plugins — once a plugin is installed, its name works in backend/engine/etc. and its sub-table carries its options. See Bring your own stack and authoring a plugin.

If you don’t pass --config, indx auto-loads ./indx.toml from the current directory when it exists. To use a different file, point the CLI or SDK at it explicitly:

Terminal window
indx ./docs --out ./ai-ready --config ./configs/prod.indx.toml
from indx import DirectoryPipeline
space = DirectoryPipeline(config="./configs/prod.indx.toml").run("./docs", "./ai-ready")

In the SDK, config accepts either a path string or an IndxConfig object, and component arguments passed to the constructor or use() still override anything the file says (per the precedence rules above).

The resolved config is recorded for reproducibility

Section titled “The resolved config is recorded for reproducibility”

After all four layers are merged, indx writes a snapshot of the resolved configuration into the output — both index.json.metadata.config and the archive’s manifest.json. That snapshot pins the exact parser, llm/model, embedder (with dim), store, and output format that produced the space, so a .indx archive is self-describing and a build is auditable long after the fact.

{
"metadata": {
"tool_version": "indx 0.4.2",
"embedder": { "name": "bge-m3", "dim": 1024 },
"config": { "...": "snapshot of resolved indx.toml" }
}
}

Combined with deterministic ids and temperature=0.0 enrichment, this is what makes re-running over unchanged input reproducible. See Reproducibility for the full guarantees.

indx parses indx.toml with the stdlib tomllib (available since Python 3.11, indx’s floor), so no extra parsing dependency is needed. tomllib is read-only — it can parse TOML but cannot write it. As a result, config scaffolding is generated from a small template and indx never round-trips (re-serializes) your file: it only ever reads the config you author, and only generates a fresh template, so your comments and formatting are never clobbered.