Configuring indx (indx.toml)

indx.toml is the optional, declarative way to pin which components your pipeline uses and how they behave. It is never required: the documented defaults make a full run work with cloud-backed model defaults, while the local profile remains available when the run must stay offline. Reach for a config file when you want a build to be reproducible, shareable, and explicit about the stack it ran on.

Configuration is optional

Every key in indx.toml has a documented default, so a bare run resolves a complete stack on its own:

indx ./docs --out ./ai-ready

This uses parser docling, llm openai:gpt-5-mini, vlm none, embedder openai:text-embedding-3-small, store qdrant, and output .indx — all resolvable without a config file once the selected extras are installed. A config file simply lets you write those choices down, override individual slots, and pass backend-specific options that the CLI flags don’t cover.

Precedence

For each component slot, indx resolves the effective value from four layers, highest priority first:

explicit code argument / use()  >  CLI flag  >  indx.toml  >  documented default

Layer	Example	Wins over
Explicit code arg / `use()`	`DirectoryPipeline(store="chroma")` or `.use(store="chroma")`	everything below
CLI flag	`indx ./docs -o ./out --store chroma`	`indx.toml` and the default
`indx.toml`	`[store]` `backend = "chroma"`	the documented default
Documented default	`store = "qdrant"`	—

A name that doesn’t resolve to a registered component (in any layer) is a fatal error raised before any stage runs, so misconfiguration fails fast rather than mid-build. See errors and exit codes — a bad config file or unknown component name exits with code 3.

A complete annotated indx.toml

Every section and key below is optional; omitted keys fall back to the documented default. The sections map one-to-one onto the pipeline’s component slots.

[parser]
engine = "docling"            # str. Parser name. Default: "docling".

[enrich]
llm      = "openai:gpt-5-mini" # str. LLM name[:model] or "none". Default: "openai:gpt-5-mini".
vlm      = "none"             # str. VLM name or "none". Default: "none".
metadata = ["type", "topics", "tags", "summary"]
                              # list[str]. Which enrichments to produce.
                              # Default: ["type", "topics", "tags", "summary"].

[embed]
model = "openai:text-embedding-3-small" # str. Embedder name. Default cloud embedder.

[store]
backend = "qdrant"            # str. One of: qdrant | pgvector | chroma | lancedb | jsonl.
                              # Default: "qdrant".

[output]
format = ".indx"             # str. One of: .indx | jsonl | langchain | llamaindex.
                              # Default: ".indx".

Key reference

Section	Key	Type	Default	Allowed values
`[parser]`	`engine`	string	`docling`	any registered parser name
`[enrich]`	`llm`	string	`openai:gpt-5-mini`	`<name>[:model]`, `none`
`[enrich]`	`vlm`	string	`none`	`<name>`, `none`
`[enrich]`	`metadata`	`list[str]`	`["type","topics","tags","summary"]`	subset of those four
`[embed]`	`model`	string	`openai:text-embedding-3-small`	any registered embedder name
`[store]`	`backend`	string	`qdrant`	`qdrant`, `pgvector`, `chroma`, `lancedb`, `jsonl`
`[output]`	`format`	string	`.indx`	`.indx`, `jsonl`, `langchain`, `llamaindex`

For the exhaustive table with every key, type, and constraint, see the configuration reference.

Secrets come from the environment, never the file

indx.toml is meant to be committed and shared, so it must not contain credentials. API keys and other secrets are supplied through environment variables and layered in by pydantic-settings, never written into the file.

export INDX_LLM__API_KEY="sk-..."     # double underscore = nested setting
indx ./docs --out ./ai-ready --llm openai:gpt-5-mini

[enrich]
llm = "openai:gpt-5-mini"   # the model choice lives here…
# …the api_key comes from $INDX_LLM__API_KEY, not this file
# pick a different model with the :model suffix, e.g. "openai:gpt-4o"

The env prefix is keyed on the slot name, not the TOML section heading — so the LLM’s key is INDX_LLM__… even though the LLM is configured under [enrich]. The double underscore separates the slot from the nested setting (INDX_<SLOT>__<SETTING>):

Slot	TOML location	Env prefix
parser	`[parser]` `engine`	`INDX_PARSER__…`
llm	`[enrich]` `llm`	`INDX_LLM__…`
vlm	`[enrich]` `vlm`	`INDX_VLM__…`
embedder	`[embed]` `model`	`INDX_EMBEDDER__…`
store	`[store]` `backend`	`INDX_STORE__…`

Backend-specific sub-tables

Most adapters accept options beyond the simple slot name. Those live in a sub-table named after the backend (e.g. [store.qdrant]). indx passes the keys in such a sub-table verbatim to the adapter constructor — they are opaque to the core, so each backend documents its own keys.

[store]
backend = "qdrant"

[store.qdrant]
url = "http://localhost:6333"   # passed straight through to the Qdrant adapter
# collection = "handbook"       # any further keys are adapter-defined

The same pattern applies to other slots and to third-party plugins — once a plugin is installed, its name works in backend/engine/etc. and its sub-table carries its options. See Bring your own stack and authoring a plugin.

Config discovery

If you don’t pass --config, indx auto-loads ./indx.toml from the current directory when it exists. To use a different file, point the CLI or SDK at it explicitly:

indx ./docs --out ./ai-ready --config ./configs/prod.indx.toml

from indx import DirectoryPipeline

space = DirectoryPipeline(config="./configs/prod.indx.toml").run("./docs", "./ai-ready")

In the SDK, config accepts either a path string or an IndxConfig object, and component arguments passed to the constructor or use() still override anything the file says (per the precedence rules above).

The resolved config is recorded for reproducibility

After all four layers are merged, indx writes a snapshot of the resolved configuration into the output — both index.json.metadata.config and the archive’s manifest.json. That snapshot pins the exact parser, llm/model, embedder (with dim), store, and output format that produced the space, so a .indx archive is self-describing and a build is auditable long after the fact.

{
  "metadata": {
    "tool_version": "indx 0.4.2",
    "embedder": { "name": "bge-m3", "dim": 1024 },
    "config": { "...": "snapshot of resolved indx.toml" }
  }
}

Combined with deterministic ids and temperature=0.0 enrichment, this is what makes re-running over unchanged input reproducible. See Reproducibility for the full guarantees.

Why TOML, and why init never rewrites it

indx parses indx.toml with the stdlib tomllib (available since Python 3.11, indx’s floor), so no extra parsing dependency is needed. tomllib is read-only — it can parse TOML but cannot write it. As a result, config scaffolding is generated from a small template and indx never round-trips (re-serializes) your file: it only ever reads the config you author, and only generates a fresh template, so your comments and formatting are never clobbered.