06 · Embed + Pack
Embed+Pack is the final stage. It vectorizes every chunk through the Embedder slot, writes those vectors into the Store, materializes the accumulated SpaceContext into a KnowledgeSpace, and seals the portable .indx archive (plus an expanded on-disk layout) through the OutputWriter slot.
This is the only stage that touches three swappable component slots at once. It is also the only stage you can skip outright with --no-embed to produce a graph-only space.
What this stage does
Section titled “What this stage does”By the time the context reaches stage 06, earlier stages have already produced the document graph, the chunk list, the resolved relations, and the enriched metadata. Embed+Pack performs four steps, in order:
- Embed — collect
chunk.textfor every chunk and pass it toEmbedder.embed(texts), populatingctx.embeddings(chunk_id → vector) and eachChunk.embedding. - Upsert — write
(id, vector, payload)records into the active store viaStore.upsert(...). - Pack —
ctx.to_space()materializes theKnowledgeSpace(documents, chunks, relations, metadata, embedding dimensionality). - Seal —
Store.persist(...)flushes vectors into theembeddings/layout, thenOutputWriter.write(space, out)writesindex.json, the per-chunk files, and the sealed.indxarchive.
ctx.chunks ──► Embedder.embed(batch) ──► Store.upsert(batch) ──► ctx.to_space() │ Store.persist() ◄───────┤ ▼ OutputWriter.write(space, out) ──► handbook.indx + expanded layoutThe three component protocols
Section titled “The three component protocols”Embed+Pack is defined entirely by these protocols. Each is a typed Protocol with a named default; see the full protocols reference.
Embedder — text to vectors
Section titled “Embedder — text to vectors”@runtime_checkableclass Embedder(Protocol): """Turns text into vectors. Default: openai:text-embedding-3-small.""" dim: int def embed(self, texts: list[str]) -> list[list[float]]: ...The default is openai:text-embedding-3-small, a cloud embedder with 1536-dimensional vectors. The dim attribute is pinned into the archive manifest so a consumer can detect a dimension mismatch before querying. For local runs, use bge-m3 from the local profile. To weigh alternatives (BGE-M3, E5, OpenAI, Cohere), see choosing an embedder.
Store — the vector database adapter
Section titled “Store — the vector database adapter”@runtime_checkableclass Store(Protocol): """Vector database adapter. Default: qdrant. Also: pgvector, chroma, lancedb, jsonl.""" def upsert(self, ids: list[str], vectors: list[list[float]], payloads: list[dict]) -> None: ... def query(self, vector: list[float], k: int = 5, filter: dict | None = None) -> list[tuple[str, float]]: ... def persist(self, dest: str) -> None: """Flush/export the store into the output `embeddings/` layout."""The default is qdrant, which runs embedded (in-process / on-disk) or against a server with the same client code. Alternatives are pgvector, chroma, lancedb, and the zero-dependency jsonl store. upsert ingests vectors during the run; query powers space.search(...) and indx query; persist is what keeps archives portable (see below). See choosing a store.
OutputWriter — serializing the space to disk
Section titled “OutputWriter — serializing the space to disk”@runtime_checkableclass OutputWriter(Protocol): """Serializes a KnowledgeSpace to disk. Default: .indx. Also: jsonl, langchain, llamaindex.""" format: str def write(self, space: KnowledgeSpace, out: str) -> None: ...The default writer is .indx. Alternatives emit jsonl shards or adapters for langchain / llamaindex. See output formats.
Concurrency: batching beats parallelism
Section titled “Concurrency: batching beats parallelism”Unlike Parse (stage 02), which fans out across files in a worker pool, embedding is driven by batching. Local embedders are far more efficient on batches because of GPU/CPU vectorization, and store upserts are batched to amortize round-trips — this is the single biggest performance lever for stage 06, and it applies whether the embedder is local or an API.
| Param | Default | Notes |
|---|---|---|
| Embed batch size | 64 | Chunk texts are grouped into batches of 64 and submitted to Embedder.embed(list[str]). |
| Embed max concurrency | --jobs | Network-bound embedders use a bounded concurrency limit; local ones lean on batch size. |
| Store upsert | batched | Vectors are written in batches to amortize round-trips. |
Store upserts are batched in lockstep so vectors land as soon as each batch is embedded. Batch size is overridable via adapter sub-tables / kwargs.
The on-disk output
Section titled “The on-disk output”Running indx ./docs --out ./ai-ready writes both the sealed archive and an expanded layout beside it, so downstream tools can read either form:
ai-ready/├── handbook.indx # the portable archive (a ZIP container)├── index.json # the knowledge graph├── chunks/ # agent-readable chunks + per-chunk context└── embeddings/ # vectors + manifestInside handbook.indx (a deflate ZIP), the layout is:
handbook.indx├── manifest.json # archive metadata + sha256 checksums├── index.json # the knowledge graph├── chunks/│ ├── chunk_0000.json│ └── …└── embeddings/ ├── manifest.json # model, dim, count, backend └── vectors.f32 # contiguous little-endian float32 matrix (count × dim)Embeddings are not inlined in index.json — they live in embeddings/ as a contiguous little-endian float32 matrix and are memory-mapped on load. The embeddings manifest pins the embedder name and dim so a loaded archive can validate query-time compatibility. The full format is documented in the .indx archive reference.
Skipping the stage: --no-embed
Section titled “Skipping the stage: --no-embed”Pass --no-embed (CLI) or drop("embed-pack") (SDK) to skip vectorization entirely and produce a graph-only space — documents, chunks, relations, and enrichment, but no vectors.
indx ./docs --out ./graph-only --no-embedpipeline = DirectoryPipeline().drop("embed-pack")space = pipeline.run("./docs", "./graph-only")A graph-only space cannot answer space.search(...) / indx query, but it is useful when you only need the directory graph, relations, and metadata — or when you plan to embed later with a different model.
A minimal run
Section titled “A minimal run”# requires the local profile: pip install "indx[bge,qdrant]"from indx import DirectoryPipeline
pipeline = DirectoryPipeline( embedder="bge-m3", # dim 1024 store="qdrant", output=".indx",)space = pipeline.run("./docs", "./ai-ready")
print(space.stats.embeddings, space.stats.embed_dim) # e.g. 1042 1024
for hit in space.search("how long is data retained?", k=3): print(hit.score, hit.source.path)Embed+Pack produces exactly one vector per chunk — so for a fully embedded space space.stats.embeddings == space.stats.chunks. (A --no-embed run is the only case where they diverge: vectors drop to zero while chunks remain.)
The progress summary for this stage looks like:
06 embed 1042 vectors → qdrant, sealed handbook.indxdone: 1042 chunks, 128 docs, embed_dim=1024 (12.4s)Where to go next
Section titled “Where to go next”- The .indx archive — full container layout, manifest, sealing and loading.
- Choosing an embedder — bge-m3 vs. E5, OpenAI, Cohere.
- Choosing a store — Qdrant, pgvector, Chroma, LanceDB, JSONL.
- Output formats —
.indx, JSONL, LangChain, LlamaIndex writers. - Pipeline overview — how all six stages fit together.