Reproducibility & Determinism

A knowledge space is only trustworthy if you can rebuild it and get the same thing back. indx is engineered so that the same inputs, the same configuration, and the same component versions produce a byte-stable index.json — and so that the parts that can’t be made bit-reproducible are at least fully recorded for audit. This page explains the guarantees, where they hold, and where they have edges.

Stable ids

Every Document and Chunk gets a zero-padded, stable string id — doc_0007, chunk_0481. These ids are not random and they are not assigned in the order work happens to finish. They are assigned in a deterministic traversal order:

Folder lineage — folder ancestry, root to leaf.
Path — the file path within a folder.
In-document index — the 0-based Chunk.index position within its parent document.

Because the ordering key is derived entirely from the directory structure and document position (never from wall-clock time, hash-map iteration order, or scheduler timing), a rerun over unchanged input yields identical ids for the same content. Ids are the anchors that downstream relations, neighbor links, and chunk_ids lists point at, so stable ids keep the whole graph diff-friendly across builds.

policies/                        policies/
  data/retention.pdf  → doc_0007   data/retention.pdf  → doc_0007   (same id every run)
    chunk 0          → chunk_0480
    chunk 1          → chunk_0481
    chunk 2          → chunk_0482

Concurrency never affects ordering

indx parallelizes the expensive stages — Parse runs across files in a worker pool, Embed runs in batches, Enrich issues bounded-concurrency LLM calls. Parallel execution means results arrive out of order. To keep that from leaking into output, parallel results are re-sorted into the canonical deterministic order before ids are assigned. The number of --jobs, the batch size, and which worker happens to finish first have no effect on the resulting ids or on the order of elements in index.json.

This is a deliberate design rule, not an accident of implementation: performance tuning must never trade away reproducibility. (See Performance for the concurrency knobs themselves.)

LLM enrichment determinism

Stage 05 (Enrich) uses an LLM to derive document type, topics, tags, and summaries. The LLM.complete(...) protocol defaults to temperature=0.0, which makes greedy decoding the default and removes sampling randomness.

That gets you as close to deterministic as the provider allows — but some providers are not bit-reproducible even at temperature 0 (batching, kernel nondeterminism, model-version rollovers, and floating-point reduction order can all shift a token). indx does not pretend otherwise. Instead, it makes enrichment auditable: the resolved configuration snapshot — including the model name and version — is recorded in two places:

index.json under metadata (alongside tool_version, created_at, and the embedder block).
The archive’s manifest.json inside the .indx container.

So even when the exact summary text can vary slightly between runs, you always know precisely which model and configuration produced it.

{
  "metadata": {
    "tool_version": "indx 0.4.2",
    "created_at": "2026-06-06T12:00:00Z",
    "embedder": { "name": "bge-m3", "dim": 1024 },
    "config": { "...": "snapshot of resolved indx.toml" }
  }
}

Embeddings: pinned and self-describing

Vectors are written to embeddings/vectors.f32 as a contiguous little-endian float32 matrix of shape count × dim. The embedder name and dim are pinned in both the embeddings sub-manifest and the archive manifest.json:

{
  "embedder": { "name": "bge-m3", "dim": 1024 },
  "store": "qdrant"
}

Pinning the model identity and dimensionality makes the archive self-describing: a consumer (or space.search(...)) can validate at load time that a query embedder matches the one that produced the stored vectors, and refuse a dimension mismatch before it silently returns garbage. As with LLMs, exact embedding values may differ across hardware or model versions, so the manifest records which model produced the vectors — re-embedding is the correct response to a changed default model.

What is and isn’t byte-stable

Output	Reproducible?	Notes
Document / chunk ids	Yes	Deterministic traversal order; unaffected by concurrency.
Element ordering in `index.json`	Yes	Parallel results re-sorted before serialization.
`index.json` bytes	Yes, modulo `created_at`	Same inputs + config + versions ⇒ byte-identical.
Neighbor links, `chunk_ids`, relation targets	Yes	They reference the stable ids.
LLM enrichment text (topics, summary, tags)	Best-effort	`temperature=0.0`; some providers not bit-reproducible. Resolved model recorded.
Embedding vectors	Best-effort	`float32` matrix; embedder name + dim pinned for validation.
`created_at` timestamp	No (by design)	UTC ISO-8601; the one deliberately varying field.

Pin your versions

The guarantee is “same inputs + same config + same component versions.” Determinism is scoped to a fixed environment:

Pin the indx version and the versions of any parser/LLM/embedder/store extras you use.
Keep indx.toml under version control; the resolved snapshot is recorded in output so you can diff it.
Changing a component invalidates only the affected work on a --resume build (changing the parser invalidates Parse and everything downstream; changing the embedder invalidates only Embed), which keeps reruns cheap without compromising correctness. See Configuration and Reproducibility-adjacent caching in Performance.

How indx tests this

Determinism is not a hope — it is a guardrail enforced in CI with byte-stable golden-file tests. The suite runs the pipeline (with a fixed seed and offline, mocked LLM/embedder backends) over a committed sample directory and asserts the produced index.json is byte-equal to a checked-in golden file:

def test_index_json_is_byte_stable(tmp_path):
    space = DirectoryPipeline(seed=0).run("tests/data/sample", tmp_path)
    produced = (tmp_path / "index.json").read_text()
    golden = Path("tests/golden/sample.index.json").read_text()
    assert produced == golden

Any unintended change to ids, ordering, or serialization fails the diff loudly. Golden files are regenerated deliberately (and gated on a schema_version bump when the serialized shape legitimately changes), never blindly. See Testing for the full approach, including the mocked backends and deterministic seeding that make this stable offline.

The .indx archive — container layout, manifest.json, checksums, and sealing/loading.
index.json reference — the serialized graph schema and metadata block.
Testing — golden-file determinism tests and deterministic seeds.
Local & air-gapped — the offline default stack that makes deterministic, no-egress builds possible.