The .indx Archive Format

A .indx file is the single portable artifact for a knowledge space — a self-describing, inspectable container that travels with everything a downstream tool needs: the knowledge graph, agent-readable chunks, and the vector matrix. It is an ordinary ZIP (deflate) file with a defined internal layout and a manifest, so you can open it with unzip as readily as with KnowledgeSpace.load.

At a glance

Property	Value
Container	ZIP, deflate compression
Access	Random access to individual members (no full decompress)
Default base name	`handbook` → `handbook.indx` (set with `--name`)
Produced by	The default `.indx` `OutputWriter` (stage 06), or `space.save(path)`
Opened by	`KnowledgeSpace.load(path)`, `indx inspect`, `indx query`
Integrity	Per-member SHA-256 checksums in `manifest.json`
Versioning	`indx_version` (semver, major-gated); `tool_version` (diagnostic)

Internal layout

A sealed archive contains a manifest, the serialized graph, one JSON file per chunk, and the embedding matrix with its own sub-manifest:

handbook.indx                 (ZIP container)
├── manifest.json             # archive metadata + integrity
├── index.json                # the knowledge graph (see the index.json reference)
├── chunks/                   # agent-readable chunks + context
│   ├── chunk_0000.json
│   ├── chunk_0001.json
│   └── …
└── embeddings/
    ├── manifest.json         # model, dim, count, backend
    └── vectors.f32           # contiguous little-endian float32 matrix (count × dim)

A few details that make the format predictable:

index.json is the full serialized knowledge graph (documents, chunks, relations, stats). It is the authoritative graph; its schema is documented in the index.json reference.
chunks/chunk_NNNN.json holds one file per chunk, named by the zero-padded stable chunk id (chunk_0000, chunk_0001, …). Each uses the canonical chunk shape and may additionally carry a resolved context window for agent consumption. Storing chunks as individual members is what makes random access cheap — a consumer can pull one chunk without reading the whole graph.
embeddings/vectors.f32 is a raw, contiguous matrix of little-endian float32 values laid out as count × dim (row-major: each chunk’s full vector, one after another). There are no delimiters or headers inside the blob — its shape comes from the embeddings manifest, which lets the matrix be memory-mapped directly.
embeddings/manifest.json records the embedder name, vector dim, vector count, and the store backend, so a reader can validate and reshape vectors.f32 without guessing.

`manifest.json`

The top-level manifest.json is the archive’s metadata and integrity record. Read it first to learn what an archive contains and whether you can open it.

{
  "indx_version": "1.0",
  "tool_version": "indx 0.4.2",
  "created_at": "2026-06-06T12:00:00Z",
  "root": "/abs/path/docs",
  "counts": { "documents": 128, "chunks": 1042, "relations": 380 },
  "embedder": { "name": "bge-m3", "dim": 1024 },
  "store": "qdrant",
  "checksum": { "algo": "sha256", "index.json": "…", "embeddings/vectors.f32": "…" }
}

Field	Type	Meaning
`indx_version`	string (semver)	Archive schema version. Gates compatibility on load.
`tool_version`	string	The producing build, e.g. `indx 0.4.2`. Diagnostic only.
`created_at`	string (ISO-8601 UTC)	Build timestamp.
`root`	string	Absolute path of the walked directory/ZIP.
`counts`	object	`documents`, `chunks`, `relations` totals.
`embedder`	object	`{ name, dim }` — pins the model and vector dimensionality.
`store`	string	The vector store backend used (e.g. `qdrant`, `jsonl`).
`checksum`	object	`algo` plus a per-member digest map (e.g. `index.json`, `embeddings/vectors.f32`).

Sealing and loading

Sealing

Sealing happens in stage 06 (Embed+Pack) via the .indx OutputWriter, or explicitly through space.save(path). The writer:

Writes index.json — the serialized knowledge graph.
Writes one per-chunk file under chunks/.
Writes the vector matrix embeddings/vectors.f32 and its embeddings/manifest.json.
Writes manifest.json, computing a SHA-256 checksum for each member.
Deflates everything into the .indx ZIP container.

Loading

KnowledgeSpace.load(archive) reverses the process, with verification gates:

Verify indx_version compatibility — see Versioning below. An incompatible archive fails fast.
Validate checksums — each member is checked against the SHA-256 digest recorded in manifest.json; a mismatch is an archive error.
Reconstruct the in-memory models — Pydantic v2 models are rebuilt from index.json and the chunk files (write and read share one schema, so the data validates on load).
Memory-map vectors on demand — vectors.f32 is mmapped rather than read eagerly, so opening a large archive and reading metadata stays cheap; vectors are touched only when a search needs them.

Versioning

.indx archives are versioned through two fields, and only one of them gates compatibility.

indx_version follows semantic versioning and controls whether an archive can be loaded:

Situation	Loader behavior
Same major version	Accepted.
Newer minor (same major)	Accepted — forward-tolerant; unknown keys are ignored.
Major mismatch	Fatal load error with a clear message.

tool_version records the producing build (e.g. indx 0.4.2) purely for diagnostics and auditing. It never affects whether an archive loads.

This mirrors indx’s broader compatibility promise: within a major version, fields are only added, never removed or retyped, and consumers must ignore unknown values rather than fail. For an end-to-end view of what makes a rebuild byte-identical, see reproducibility.

The expanded on-disk layout (unsealed)

Running a build writes the expanded layout alongside the sealed archive, so downstream tools can read either the portable container or the loose files directly — whichever is more convenient.

indx ./docs --out ./ai-ready

ai-ready/
├── handbook.indx        # the portable archive (sealed)
├── index.json           # the knowledge graph
├── chunks/              # agent-readable chunks + per-chunk context
└── embeddings/          # vectors + manifest

The loose index.json, chunks/, and embeddings/ mirror the members inside handbook.indx. The .indx file is what you ship or version-control as a single unit; the expanded form is handy for grepping, diffing, or wiring into a tool that wants plain files.

Why ZIP (and not tar.gz or SQLite)

The container choice is deliberate and serves indx’s “open, no-lock-in artifact” goal:

Random access to individual members. ZIP lets a consumer read just manifest.json, or just one chunk, without decompressing the whole file — something a streamed tar.gz cannot do.
Stdlib and cross-platform. ZIP is handled by Python’s stdlib zipfile, so reading and writing archives adds no dependency to the light core.
Inspectable with ubiquitous tooling. Anyone can unzip handbook.indx and read the JSON by hand — important for an artifact meant to be a public contract.

SQLite was considered (single-file and queryable) but is opaque to non-SQLite tooling and would couple the artifact to a query engine, so SQLite remains a store option rather than the archive format. The trade-off accepted for ZIP is slightly less compression efficiency for many tiny files than a solid tar.gz stream — a fair price for random access and tooling ubiquity.