The .indx Archive Format
A .indx file is the single portable artifact for a knowledge space — a self-describing, inspectable container that travels with everything a downstream tool needs: the knowledge graph, agent-readable chunks, and the vector matrix. It is an ordinary ZIP (deflate) file with a defined internal layout and a manifest, so you can open it with unzip as readily as with KnowledgeSpace.load.
At a glance
Section titled “At a glance”| Property | Value |
|---|---|
| Container | ZIP, deflate compression |
| Access | Random access to individual members (no full decompress) |
| Default base name | handbook → handbook.indx (set with --name) |
| Produced by | The default .indx OutputWriter (stage 06), or space.save(path) |
| Opened by | KnowledgeSpace.load(path), indx inspect, indx query |
| Integrity | Per-member SHA-256 checksums in manifest.json |
| Versioning | indx_version (semver, major-gated); tool_version (diagnostic) |
Internal layout
Section titled “Internal layout”A sealed archive contains a manifest, the serialized graph, one JSON file per chunk, and the embedding matrix with its own sub-manifest:
handbook.indx (ZIP container)├── manifest.json # archive metadata + integrity├── index.json # the knowledge graph (see the index.json reference)├── chunks/ # agent-readable chunks + context│ ├── chunk_0000.json│ ├── chunk_0001.json│ └── …└── embeddings/ ├── manifest.json # model, dim, count, backend └── vectors.f32 # contiguous little-endian float32 matrix (count × dim)A few details that make the format predictable:
index.jsonis the full serialized knowledge graph (documents, chunks, relations, stats). It is the authoritative graph; its schema is documented in theindex.jsonreference.chunks/chunk_NNNN.jsonholds one file per chunk, named by the zero-padded stable chunk id (chunk_0000,chunk_0001, …). Each uses the canonical chunk shape and may additionally carry a resolved context window for agent consumption. Storing chunks as individual members is what makes random access cheap — a consumer can pull one chunk without reading the whole graph.embeddings/vectors.f32is a raw, contiguous matrix of little-endianfloat32values laid out ascount × dim(row-major: each chunk’s full vector, one after another). There are no delimiters or headers inside the blob — its shape comes from the embeddings manifest, which lets the matrix be memory-mapped directly.embeddings/manifest.jsonrecords the embeddername, vectordim, vectorcount, and the store backend, so a reader can validate and reshapevectors.f32without guessing.
manifest.json
Section titled “manifest.json”The top-level manifest.json is the archive’s metadata and integrity record. Read it first to learn what an archive contains and whether you can open it.
{ "indx_version": "1.0", "tool_version": "indx 0.4.2", "created_at": "2026-06-06T12:00:00Z", "root": "/abs/path/docs", "counts": { "documents": 128, "chunks": 1042, "relations": 380 }, "embedder": { "name": "bge-m3", "dim": 1024 }, "store": "qdrant", "checksum": { "algo": "sha256", "index.json": "…", "embeddings/vectors.f32": "…" }}| Field | Type | Meaning |
|---|---|---|
indx_version | string (semver) | Archive schema version. Gates compatibility on load. |
tool_version | string | The producing build, e.g. indx 0.4.2. Diagnostic only. |
created_at | string (ISO-8601 UTC) | Build timestamp. |
root | string | Absolute path of the walked directory/ZIP. |
counts | object | documents, chunks, relations totals. |
embedder | object | { name, dim } — pins the model and vector dimensionality. |
store | string | The vector store backend used (e.g. qdrant, jsonl). |
checksum | object | algo plus a per-member digest map (e.g. index.json, embeddings/vectors.f32). |
Sealing and loading
Section titled “Sealing and loading”Sealing
Section titled “Sealing”Sealing happens in stage 06 (Embed+Pack) via the .indx OutputWriter, or explicitly through space.save(path). The writer:
- Writes
index.json— the serialized knowledge graph. - Writes one per-chunk file under
chunks/. - Writes the vector matrix
embeddings/vectors.f32and itsembeddings/manifest.json. - Writes
manifest.json, computing a SHA-256 checksum for each member. - Deflates everything into the
.indxZIP container.
Loading
Section titled “Loading”KnowledgeSpace.load(archive) reverses the process, with verification gates:
- Verify
indx_versioncompatibility — see Versioning below. An incompatible archive fails fast. - Validate checksums — each member is checked against the SHA-256 digest recorded in
manifest.json; a mismatch is an archive error. - Reconstruct the in-memory models — Pydantic v2 models are rebuilt from
index.jsonand the chunk files (write and read share one schema, so the data validates on load). - Memory-map vectors on demand —
vectors.f32ismmapped rather than read eagerly, so opening a large archive and reading metadata stays cheap; vectors are touched only when a search needs them.
Versioning
Section titled “Versioning”.indx archives are versioned through two fields, and only one of them gates compatibility.
indx_version follows semantic versioning and controls whether an archive can be loaded:
| Situation | Loader behavior |
|---|---|
| Same major version | Accepted. |
| Newer minor (same major) | Accepted — forward-tolerant; unknown keys are ignored. |
| Major mismatch | Fatal load error with a clear message. |
tool_version records the producing build (e.g. indx 0.4.2) purely for diagnostics and auditing. It never affects whether an archive loads.
This mirrors indx’s broader compatibility promise: within a major version, fields are only added, never removed or retyped, and consumers must ignore unknown values rather than fail. For an end-to-end view of what makes a rebuild byte-identical, see reproducibility.
The expanded on-disk layout (unsealed)
Section titled “The expanded on-disk layout (unsealed)”Running a build writes the expanded layout alongside the sealed archive, so downstream tools can read either the portable container or the loose files directly — whichever is more convenient.
indx ./docs --out ./ai-readyai-ready/├── handbook.indx # the portable archive (sealed)├── index.json # the knowledge graph├── chunks/ # agent-readable chunks + per-chunk context└── embeddings/ # vectors + manifestThe loose index.json, chunks/, and embeddings/ mirror the members inside handbook.indx. The .indx file is what you ship or version-control as a single unit; the expanded form is handy for grepping, diffing, or wiring into a tool that wants plain files.
Why ZIP (and not tar.gz or SQLite)
Section titled “Why ZIP (and not tar.gz or SQLite)”The container choice is deliberate and serves indx’s “open, no-lock-in artifact” goal:
- Random access to individual members. ZIP lets a consumer read just
manifest.json, or just one chunk, without decompressing the whole file — something a streamedtar.gzcannot do. - Stdlib and cross-platform. ZIP is handled by Python’s stdlib
zipfile, so reading and writing archives adds no dependency to the light core. - Inspectable with ubiquitous tooling. Anyone can
unzip handbook.indxand read the JSON by hand — important for an artifact meant to be a public contract.
SQLite was considered (single-file and queryable) but is opaque to non-SQLite tooling and would couple the artifact to a query engine, so SQLite remains a store option rather than the archive format. The trade-off accepted for ZIP is slightly less compression efficiency for many tiny files than a solid tar.gz stream — a fair price for random access and tooling ubiquity.
See also
Section titled “See also”- The
index.jsonschema — the full structure of the knowledge graph inside the archive. - Inspect and query an archive —
indx inspectandindx queryover a.indxfile. - Output formats —
.indxvsjsonl,langchain, andllamaindexwriters. - Reproducibility — what makes a rebuilt archive byte-identical.