Skip to content

Performance & Scaling

indx is built to turn whole document estates — not just a handful of files — into a knowledge space. This guide shows how to keep large runs fast and memory-stable by tuning the right lever at the right stage: parallelism for parsing, batching for embedding, bounded concurrency for enrichment, and content-addressed caching for cheap re-runs.

The headline targets indx is engineered against:

  • Time-to-first-space: under 60 seconds on a small directory (~10 docs) on a typical laptop with defaults.
  • Scale: directories of 10k+ files processed crash-free in over 99% of runs, streaming rather than loading the whole estate into memory.
  • Memory: a 2 GB folder must not require 2 GB of RAM — files are processed as a stream, not materialised all at once.

Each of the six stages (Walk → Parse → Chunk → Relate → Enrich → Embed+Pack) has a different workload profile, so there is no single concurrency knob. indx applies a stage-appropriate strategy:

StageWorkloadStrategyDefault
01 WalkI/O, CPUSingle-pass; may parallelise per-folder traversal
02 ParseBlocking native / CPU-boundWorker pool across files (thread pool; process pool for GIL-bound parsers)--jobs (CPU count)
03 ChunkCPU-boundSingle-pass
04 RelateCPU-boundSingle-pass
05 EnrichNetwork/model-bound (LLM/VLM)Bounded concurrency, per-provider rate-limit awareconcurrency 4
06 Embed+PackModel + store I/OBatched embed + batched Store.upsertbatch 64

Two principles run through this table:

  • Parse is embarrassingly parallel across files. A worker pool of --jobs runs Parser.parse concurrently and merges results into the context keyed by doc_id. Parsers that hold native or GIL-bound resources (such as Docling) may run in a process pool instead of a thread pool.
  • Batching beats parallelism for embeddings. Local embedders like bge-m3 are far more efficient on batches (CPU/GPU vectorisation), and vector-store upserts are batched to amortise round-trips. This is the single biggest performance lever, and it applies whether the embedder is local or a cloud API.

For network-bound cloud calls (a hosted LLM, a remote embedder, a server-mode store), indx uses asyncio with bounded concurrency and per-provider rate limiting, so latency is hidden without spawning threads. The default profile is deliberately conservative and bounded so a build on a laptop never exhausts memory or hammers a rate-limited API.

These are the defaults; each is overridable via adapter sub-tables in indx.toml or component kwargs in the SDK.

StageParameterDefault
Embedbatch size64
Embedmax concurrency--jobs
Enrichmax concurrency4
Parseworkers--jobs

--jobs (alias -j) defaults to the CPU count and controls both parse workers and embed concurrency:

Terminal window
# Use 8 parse workers / embed concurrency
indx ./docs --out ./ai-ready --jobs 8

indx keeps a content-addressed cache under <out>/.indx-cache/. Each entry is keyed on:

(stage, sha256(input), component-id, relevant-config)

Pass --resume to reuse any cache entry whose key is unchanged, skipping recomputation for unmodified files and unchanged configuration:

Terminal window
indx ./docs --out ./ai-ready --resume

Because the key includes the component identity and the relevant config, invalidation is precise and scoped to what actually changed:

You change…Invalidates
The parser (--parser)Parse and everything downstream
The embedder (--embedder)Only Embed
A single source fileThat file’s entries (and their downstream)
NothingNothing — the whole run is cache hits

This is why re-running over a large estate after a small edit is cheap: only the touched files and the stages affected by your change are recomputed. The cache also makes stages idempotent — re-running a stage on its own output never duplicates work or corrupts state.

In --verbose mode, cache hits and misses are reported per stage, so you can confirm a resume is doing what you expect:

Terminal window
indx ./docs --out ./ai-ready --resume --verbose

The performance contract is that indx streams the estate rather than materialising it. Walk and the downstream stages process files as an iterator, holding the working set — not the whole directory — in memory. Files are read incrementally where the parser allows, and large intermediate buffers are released promptly.

Concretely, this is what keeps a 2 GB folder from needing 2 GB of RAM:

  • Walk yields files lazily; nothing assumes the full file list fits in memory.
  • Parse runs in bounded worker pools, so only --jobs documents are in flight at once.
  • Embed and upsert flow in batches of 64, so vectors are written and dropped rather than accumulated.
  • Vectors in a sealed .indx archive are memory-mapped on demand from vectors.f32 on load, not read whole.

When building a custom stage or your own ingestion loop, follow the stream-then-batch shape rather than loading everything and processing one item at a time:

from itertools import islice
def batched(iterable, size):
it = iter(iterable)
while batch := list(islice(it, size)):
yield batch
# Stream the walk, embed and upsert in batches, resume on cache hits.
for batch in batched(ctx.chunks, size=64):
if cache.has(batch): # resume: skip completed work
continue
vectors = embedder.embed([c.text for c in batch])
store.upsert(
ids=[c.id for c in batch],
vectors=vectors,
payloads=[c.metadata for c in batch],
)

The anti-pattern to avoid is the opposite — reading every file up front and embedding one chunk at a time:

# Anti-pattern: materialises the whole estate, one round-trip per chunk
texts = [p.read_text() for p in all_files]
vectors = [embedder.embed([t]) for t in texts]

The shipped zero-config defaults are cloud-backed, so model-heavy work runs on managed APIs out of the box. The opt-in local profile (docling + ollama:qwen2.5 + bge-m3) still supports the air-gapped path with zero network calls, but local models can be the slow part of a large run on commodity hardware. When you are running the local profile, the mitigations below apply, in order of impact:

  1. Parallelise parsing. Raise --jobs to match your cores; parsing is usually the first bottleneck on document-heavy corpora.

  2. Lean on batching. Keep embedding batched (default 64); increase the batch size if you have GPU/CPU headroom.

  3. Skip what you do not need. Use --no-embed to produce a graph-only space (Walk → Relate, no vectors) when you only need structure, and skip enrichment by dropping the Enrich stage when you do not want LLM work:

    Terminal window
    # Graph only — no embedding, fastest path to structure
    indx ./docs --out ./ai-ready --no-embed
    from indx import DirectoryPipeline
    # Drop enrichment to skip all LLM calls
    space = DirectoryPipeline().drop("enrich").run("./docs", "./out")
  4. Lean back on hosted models for the heaviest stages when policy allows — the same cloud backends used by the zero-config defaults. A hosted LLM for Enrich or a hosted embedder benefits from asyncio bounded concurrency and can dramatically cut wall-clock time on large estates. See Enrichment with LLM/VLM and Choosing an embedder.

  5. Resume aggressively. Combine --resume with the cache so iterative runs only pay for what changed.

GoalLever
Parse faster--jobs / -j (parse workers)
Embed fasterembed batch size (default 64), --jobs (embed concurrency)
Hide cloud latencyEnrich/embed bounded concurrency (asyncio)
Cheap re-runs--resume (+ --verbose to see hits/misses)
Skip vectors--no-embed (graph-only space)
Skip LLM workdrop the enrich stage
Stay memory-stablestream + batch; let the defaults do their job

For the full flag list see the CLI reference (--jobs, --resume, --no-embed); for keeping runs auditable and byte-stable see Reproducibility.