Skip to content

The .indx Archive Format

A .indx file is the single portable artifact for a knowledge space — a self-describing, inspectable container that travels with everything a downstream tool needs: the knowledge graph, agent-readable chunks, and the vector matrix. It is an ordinary ZIP (deflate) file with a defined internal layout and a manifest, so you can open it with unzip as readily as with KnowledgeSpace.load.

PropertyValue
ContainerZIP, deflate compression
AccessRandom access to individual members (no full decompress)
Default base namehandbookhandbook.indx (set with --name)
Produced byThe default .indx OutputWriter (stage 06), or space.save(path)
Opened byKnowledgeSpace.load(path), indx inspect, indx query
IntegrityPer-member SHA-256 checksums in manifest.json
Versioningindx_version (semver, major-gated); tool_version (diagnostic)

A sealed archive contains a manifest, the serialized graph, one JSON file per chunk, and the embedding matrix with its own sub-manifest:

handbook.indx (ZIP container)
├── manifest.json # archive metadata + integrity
├── index.json # the knowledge graph (see the index.json reference)
├── chunks/ # agent-readable chunks + context
│ ├── chunk_0000.json
│ ├── chunk_0001.json
│ └── …
└── embeddings/
├── manifest.json # model, dim, count, backend
└── vectors.f32 # contiguous little-endian float32 matrix (count × dim)

A few details that make the format predictable:

  • index.json is the full serialized knowledge graph (documents, chunks, relations, stats). It is the authoritative graph; its schema is documented in the index.json reference.
  • chunks/chunk_NNNN.json holds one file per chunk, named by the zero-padded stable chunk id (chunk_0000, chunk_0001, …). Each uses the canonical chunk shape and may additionally carry a resolved context window for agent consumption. Storing chunks as individual members is what makes random access cheap — a consumer can pull one chunk without reading the whole graph.
  • embeddings/vectors.f32 is a raw, contiguous matrix of little-endian float32 values laid out as count × dim (row-major: each chunk’s full vector, one after another). There are no delimiters or headers inside the blob — its shape comes from the embeddings manifest, which lets the matrix be memory-mapped directly.
  • embeddings/manifest.json records the embedder name, vector dim, vector count, and the store backend, so a reader can validate and reshape vectors.f32 without guessing.

The top-level manifest.json is the archive’s metadata and integrity record. Read it first to learn what an archive contains and whether you can open it.

{
"indx_version": "1.0",
"tool_version": "indx 0.4.2",
"created_at": "2026-06-06T12:00:00Z",
"root": "/abs/path/docs",
"counts": { "documents": 128, "chunks": 1042, "relations": 380 },
"embedder": { "name": "bge-m3", "dim": 1024 },
"store": "qdrant",
"checksum": { "algo": "sha256", "index.json": "", "embeddings/vectors.f32": "" }
}
FieldTypeMeaning
indx_versionstring (semver)Archive schema version. Gates compatibility on load.
tool_versionstringThe producing build, e.g. indx 0.4.2. Diagnostic only.
created_atstring (ISO-8601 UTC)Build timestamp.
rootstringAbsolute path of the walked directory/ZIP.
countsobjectdocuments, chunks, relations totals.
embedderobject{ name, dim } — pins the model and vector dimensionality.
storestringThe vector store backend used (e.g. qdrant, jsonl).
checksumobjectalgo plus a per-member digest map (e.g. index.json, embeddings/vectors.f32).

Sealing happens in stage 06 (Embed+Pack) via the .indx OutputWriter, or explicitly through space.save(path). The writer:

  1. Writes index.json — the serialized knowledge graph.
  2. Writes one per-chunk file under chunks/.
  3. Writes the vector matrix embeddings/vectors.f32 and its embeddings/manifest.json.
  4. Writes manifest.json, computing a SHA-256 checksum for each member.
  5. Deflates everything into the .indx ZIP container.

KnowledgeSpace.load(archive) reverses the process, with verification gates:

  1. Verify indx_version compatibility — see Versioning below. An incompatible archive fails fast.
  2. Validate checksums — each member is checked against the SHA-256 digest recorded in manifest.json; a mismatch is an archive error.
  3. Reconstruct the in-memory models — Pydantic v2 models are rebuilt from index.json and the chunk files (write and read share one schema, so the data validates on load).
  4. Memory-map vectors on demandvectors.f32 is mmapped rather than read eagerly, so opening a large archive and reading metadata stays cheap; vectors are touched only when a search needs them.

.indx archives are versioned through two fields, and only one of them gates compatibility.

indx_version follows semantic versioning and controls whether an archive can be loaded:

SituationLoader behavior
Same major versionAccepted.
Newer minor (same major)Accepted — forward-tolerant; unknown keys are ignored.
Major mismatchFatal load error with a clear message.

tool_version records the producing build (e.g. indx 0.4.2) purely for diagnostics and auditing. It never affects whether an archive loads.

This mirrors indx’s broader compatibility promise: within a major version, fields are only added, never removed or retyped, and consumers must ignore unknown values rather than fail. For an end-to-end view of what makes a rebuild byte-identical, see reproducibility.

Running a build writes the expanded layout alongside the sealed archive, so downstream tools can read either the portable container or the loose files directly — whichever is more convenient.

Terminal window
indx ./docs --out ./ai-ready
ai-ready/
├── handbook.indx # the portable archive (sealed)
├── index.json # the knowledge graph
├── chunks/ # agent-readable chunks + per-chunk context
└── embeddings/ # vectors + manifest

The loose index.json, chunks/, and embeddings/ mirror the members inside handbook.indx. The .indx file is what you ship or version-control as a single unit; the expanded form is handy for grepping, diffing, or wiring into a tool that wants plain files.

The container choice is deliberate and serves indx’s “open, no-lock-in artifact” goal:

  • Random access to individual members. ZIP lets a consumer read just manifest.json, or just one chunk, without decompressing the whole file — something a streamed tar.gz cannot do.
  • Stdlib and cross-platform. ZIP is handled by Python’s stdlib zipfile, so reading and writing archives adds no dependency to the light core.
  • Inspectable with ubiquitous tooling. Anyone can unzip handbook.indx and read the JSON by hand — important for an artifact meant to be a public contract.

SQLite was considered (single-file and queryable) but is opaque to non-SQLite tooling and would couple the artifact to a query engine, so SQLite remains a store option rather than the archive format. The trade-off accepted for ZIP is slightly less compression efficiency for many tiny files than a solid tar.gz stream — a fair price for random access and tooling ubiquity.