Reproducibility & Determinism
A knowledge space is only trustworthy if you can rebuild it and get the same thing back. indx is engineered so that the same inputs, the same configuration, and the same component versions produce a byte-stable index.json — and so that the parts that can’t be made bit-reproducible are at least fully recorded for audit. This page explains the guarantees, where they hold, and where they have edges.
Stable ids
Section titled “Stable ids”Every Document and Chunk gets a zero-padded, stable string id — doc_0007, chunk_0481. These ids are not random and they are not assigned in the order work happens to finish. They are assigned in a deterministic traversal order:
- Folder lineage — folder ancestry, root to leaf.
- Path — the file path within a folder.
- In-document index — the 0-based
Chunk.indexposition within its parent document.
Because the ordering key is derived entirely from the directory structure and document position (never from wall-clock time, hash-map iteration order, or scheduler timing), a rerun over unchanged input yields identical ids for the same content. Ids are the anchors that downstream relations, neighbor links, and chunk_ids lists point at, so stable ids keep the whole graph diff-friendly across builds.
policies/ policies/ data/retention.pdf → doc_0007 data/retention.pdf → doc_0007 (same id every run) chunk 0 → chunk_0480 chunk 1 → chunk_0481 chunk 2 → chunk_0482Concurrency never affects ordering
Section titled “Concurrency never affects ordering”indx parallelizes the expensive stages — Parse runs across files in a worker pool, Embed runs in batches, Enrich issues bounded-concurrency LLM calls. Parallel execution means results arrive out of order. To keep that from leaking into output, parallel results are re-sorted into the canonical deterministic order before ids are assigned. The number of --jobs, the batch size, and which worker happens to finish first have no effect on the resulting ids or on the order of elements in index.json.
This is a deliberate design rule, not an accident of implementation: performance tuning must never trade away reproducibility. (See Performance for the concurrency knobs themselves.)
LLM enrichment determinism
Section titled “LLM enrichment determinism”Stage 05 (Enrich) uses an LLM to derive document type, topics, tags, and summaries. The LLM.complete(...) protocol defaults to temperature=0.0, which makes greedy decoding the default and removes sampling randomness.
That gets you as close to deterministic as the provider allows — but some providers are not bit-reproducible even at temperature 0 (batching, kernel nondeterminism, model-version rollovers, and floating-point reduction order can all shift a token). indx does not pretend otherwise. Instead, it makes enrichment auditable: the resolved configuration snapshot — including the model name and version — is recorded in two places:
index.jsonundermetadata(alongsidetool_version,created_at, and the embedder block).- The archive’s
manifest.jsoninside the.indxcontainer.
So even when the exact summary text can vary slightly between runs, you always know precisely which model and configuration produced it.
{ "metadata": { "tool_version": "indx 0.4.2", "created_at": "2026-06-06T12:00:00Z", "embedder": { "name": "bge-m3", "dim": 1024 }, "config": { "...": "snapshot of resolved indx.toml" } }}Embeddings: pinned and self-describing
Section titled “Embeddings: pinned and self-describing”Vectors are written to embeddings/vectors.f32 as a contiguous little-endian float32 matrix of shape count × dim. The embedder name and dim are pinned in both the embeddings sub-manifest and the archive manifest.json:
{ "embedder": { "name": "bge-m3", "dim": 1024 }, "store": "qdrant"}Pinning the model identity and dimensionality makes the archive self-describing: a consumer (or space.search(...)) can validate at load time that a query embedder matches the one that produced the stored vectors, and refuse a dimension mismatch before it silently returns garbage. As with LLMs, exact embedding values may differ across hardware or model versions, so the manifest records which model produced the vectors — re-embedding is the correct response to a changed default model.
What is and isn’t byte-stable
Section titled “What is and isn’t byte-stable”| Output | Reproducible? | Notes |
|---|---|---|
| Document / chunk ids | Yes | Deterministic traversal order; unaffected by concurrency. |
Element ordering in index.json | Yes | Parallel results re-sorted before serialization. |
index.json bytes | Yes, modulo created_at | Same inputs + config + versions ⇒ byte-identical. |
Neighbor links, chunk_ids, relation targets | Yes | They reference the stable ids. |
| LLM enrichment text (topics, summary, tags) | Best-effort | temperature=0.0; some providers not bit-reproducible. Resolved model recorded. |
| Embedding vectors | Best-effort | float32 matrix; embedder name + dim pinned for validation. |
created_at timestamp | No (by design) | UTC ISO-8601; the one deliberately varying field. |
Pin your versions
Section titled “Pin your versions”The guarantee is “same inputs + same config + same component versions.” Determinism is scoped to a fixed environment:
- Pin the indx version and the versions of any parser/LLM/embedder/store extras you use.
- Keep
indx.tomlunder version control; the resolved snapshot is recorded in output so you can diff it. - Changing a component invalidates only the affected work on a
--resumebuild (changing the parser invalidates Parse and everything downstream; changing the embedder invalidates only Embed), which keeps reruns cheap without compromising correctness. See Configuration and Reproducibility-adjacent caching in Performance.
How indx tests this
Section titled “How indx tests this”Determinism is not a hope — it is a guardrail enforced in CI with byte-stable golden-file tests. The suite runs the pipeline (with a fixed seed and offline, mocked LLM/embedder backends) over a committed sample directory and asserts the produced index.json is byte-equal to a checked-in golden file:
def test_index_json_is_byte_stable(tmp_path): space = DirectoryPipeline(seed=0).run("tests/data/sample", tmp_path) produced = (tmp_path / "index.json").read_text() golden = Path("tests/golden/sample.index.json").read_text() assert produced == goldenAny unintended change to ids, ordering, or serialization fails the diff loudly. Golden files are regenerated deliberately (and gated on a schema_version bump when the serialized shape legitimately changes), never blindly. See Testing for the full approach, including the mocked backends and deterministic seeding that make this stable offline.
Related reading
Section titled “Related reading”- The
.indxarchive — container layout,manifest.json, checksums, and sealing/loading. index.jsonreference — the serialized graph schema andmetadatablock.- Testing — golden-file determinism tests and deterministic seeds.
- Local & air-gapped — the offline default stack that makes deterministic, no-egress builds possible.