Performance & Scaling

indx is built to turn whole document estates — not just a handful of files — into a knowledge space. This guide shows how to keep large runs fast and memory-stable by tuning the right lever at the right stage: parallelism for parsing, batching for embedding, bounded concurrency for enrichment, and content-addressed caching for cheap re-runs.

The headline targets indx is engineered against:

Time-to-first-space: under 60 seconds on a small directory (~10 docs) on a typical laptop with defaults.
Scale: directories of 10k+ files processed crash-free in over 99% of runs, streaming rather than loading the whole estate into memory.
Memory: a 2 GB folder must not require 2 GB of RAM — files are processed as a stream, not materialised all at once.

The mental model: one lever per stage

Each of the six stages (Walk → Parse → Chunk → Relate → Enrich → Embed+Pack) has a different workload profile, so there is no single concurrency knob. indx applies a stage-appropriate strategy:

Stage	Workload	Strategy	Default
01 Walk	I/O, CPU	Single-pass; may parallelise per-folder traversal	—
02 Parse	Blocking native / CPU-bound	Worker pool across files (thread pool; process pool for GIL-bound parsers)	`--jobs` (CPU count)
03 Chunk	CPU-bound	Single-pass	—
04 Relate	CPU-bound	Single-pass	—
05 Enrich	Network/model-bound (LLM/VLM)	Bounded concurrency, per-provider rate-limit aware	concurrency 4
06 Embed+Pack	Model + store I/O	Batched embed + batched `Store.upsert`	batch 64

Two principles run through this table:

Parse is embarrassingly parallel across files. A worker pool of --jobs runs Parser.parse concurrently and merges results into the context keyed by doc_id. Parsers that hold native or GIL-bound resources (such as Docling) may run in a process pool instead of a thread pool.
Batching beats parallelism for embeddings. Local embedders like bge-m3 are far more efficient on batches (CPU/GPU vectorisation), and vector-store upserts are batched to amortise round-trips. This is the single biggest performance lever, and it applies whether the embedder is local or a cloud API.

For network-bound cloud calls (a hosted LLM, a remote embedder, a server-mode store), indx uses asyncio with bounded concurrency and per-provider rate limiting, so latency is hidden without spawning threads. The default profile is deliberately conservative and bounded so a build on a laptop never exhausts memory or hammers a rate-limited API.

Batching & concurrency parameters

These are the defaults; each is overridable via adapter sub-tables in indx.toml or component kwargs in the SDK.

Stage	Parameter	Default
Embed	batch size	`64`
Embed	max concurrency	`--jobs`
Enrich	max concurrency	`4`
Parse	workers	`--jobs`

--jobs (alias -j) defaults to the CPU count and controls both parse workers and embed concurrency:

# Use 8 parse workers / embed concurrency
indx ./docs --out ./ai-ready --jobs 8

Caching & resume

indx keeps a content-addressed cache under <out>/.indx-cache/. Each entry is keyed on:

(stage, sha256(input), component-id, relevant-config)

Pass --resume to reuse any cache entry whose key is unchanged, skipping recomputation for unmodified files and unchanged configuration:

indx ./docs --out ./ai-ready --resume

Because the key includes the component identity and the relevant config, invalidation is precise and scoped to what actually changed:

You change…	Invalidates
The parser (`--parser`)	Parse and everything downstream
The embedder (`--embedder`)	Only Embed
A single source file	That file’s entries (and their downstream)
Nothing	Nothing — the whole run is cache hits

This is why re-running over a large estate after a small edit is cheap: only the touched files and the stages affected by your change are recomputed. The cache also makes stages idempotent — re-running a stage on its own output never duplicates work or corrupts state.

In --verbose mode, cache hits and misses are reported per stage, so you can confirm a resume is doing what you expect:

indx ./docs --out ./ai-ready --resume --verbose

Streaming & memory rules

The performance contract is that indx streams the estate rather than materialising it. Walk and the downstream stages process files as an iterator, holding the working set — not the whole directory — in memory. Files are read incrementally where the parser allows, and large intermediate buffers are released promptly.

Concretely, this is what keeps a 2 GB folder from needing 2 GB of RAM:

Walk yields files lazily; nothing assumes the full file list fits in memory.
Parse runs in bounded worker pools, so only --jobs documents are in flight at once.
Embed and upsert flow in batches of 64, so vectors are written and dropped rather than accumulated.
Vectors in a sealed .indx archive are memory-mapped on demand from vectors.f32 on load, not read whole.

Stream + batch pattern (SDK)

When building a custom stage or your own ingestion loop, follow the stream-then-batch shape rather than loading everything and processing one item at a time:

from itertools import islice

def batched(iterable, size):
    it = iter(iterable)
    while batch := list(islice(it, size)):
        yield batch

# Stream the walk, embed and upsert in batches, resume on cache hits.
for batch in batched(ctx.chunks, size=64):
    if cache.has(batch):           # resume: skip completed work
        continue
    vectors = embedder.embed([c.text for c in batch])
    store.upsert(
        ids=[c.id for c in batch],
        vectors=vectors,
        payloads=[c.metadata for c in batch],
    )

The anti-pattern to avoid is the opposite — reading every file up front and embedding one chunk at a time:

# Anti-pattern: materialises the whole estate, one round-trip per chunk
texts = [p.read_text() for p in all_files]
vectors = [embedder.embed([t]) for t in texts]

Making the local profile fast enough

The shipped zero-config defaults are cloud-backed, so model-heavy work runs on managed APIs out of the box. The opt-in local profile (docling + ollama:qwen2.5 + bge-m3) still supports the air-gapped path with zero network calls, but local models can be the slow part of a large run on commodity hardware. When you are running the local profile, the mitigations below apply, in order of impact:

Parallelise parsing. Raise --jobs to match your cores; parsing is usually the first bottleneck on document-heavy corpora.
Lean on batching. Keep embedding batched (default 64); increase the batch size if you have GPU/CPU headroom.

Skip what you do not need. Use --no-embed to produce a graph-only space (Walk → Relate, no vectors) when you only need structure, and skip enrichment by dropping the Enrich stage when you do not want LLM work:

# Graph only — no embedding, fastest path to structure
indx ./docs --out ./ai-ready --no-embed

from indx import DirectoryPipeline

# Drop enrichment to skip all LLM calls
space = DirectoryPipeline().drop("enrich").run("./docs", "./out")

Lean back on hosted models for the heaviest stages when policy allows — the same cloud backends used by the zero-config defaults. A hosted LLM for Enrich or a hosted embedder benefits from asyncio bounded concurrency and can dramatically cut wall-clock time on large estates. See Enrichment with LLM/VLM and Choosing an embedder.
Resume aggressively. Combine --resume with the cache so iterative runs only pay for what changed.

Quick reference

Goal	Lever
Parse faster	`--jobs` / `-j` (parse workers)
Embed faster	embed batch size (default 64), `--jobs` (embed concurrency)
Hide cloud latency	Enrich/embed bounded concurrency (`asyncio`)
Cheap re-runs	`--resume` (+ `--verbose` to see hits/misses)
Skip vectors	`--no-embed` (graph-only space)
Skip LLM work	drop the `enrich` stage
Stay memory-stable	stream + batch; let the defaults do their job

For the full flag list see the CLI reference (--jobs, --resume, --no-embed); for keeping runs auditable and byte-stable see Reproducibility.