06 · Embed + Pack

Embed+Pack is the final stage. It vectorizes every chunk through the Embedder slot, writes those vectors into the Store, materializes the accumulated SpaceContext into a KnowledgeSpace, and seals the portable .indx archive (plus an expanded on-disk layout) through the OutputWriter slot.

This is the only stage that touches three swappable component slots at once. It is also the only stage you can skip outright with --no-embed to produce a graph-only space.

What this stage does

By the time the context reaches stage 06, earlier stages have already produced the document graph, the chunk list, the resolved relations, and the enriched metadata. Embed+Pack performs four steps, in order:

Embed — collect chunk.text for every chunk and pass it to Embedder.embed(texts), populating ctx.embeddings (chunk_id → vector) and each Chunk.embedding.
Upsert — write (id, vector, payload) records into the active store via Store.upsert(...).
Pack — ctx.to_space() materializes the KnowledgeSpace (documents, chunks, relations, metadata, embedding dimensionality).
Seal — Store.persist(...) flushes vectors into the embeddings/ layout, then OutputWriter.write(space, out) writes index.json, the per-chunk files, and the sealed .indx archive.

ctx.chunks ──► Embedder.embed(batch) ──► Store.upsert(batch) ──► ctx.to_space()
                                                                      │
                                              Store.persist() ◄───────┤
                                                                      ▼
                                              OutputWriter.write(space, out) ──► handbook.indx + expanded layout

The three component protocols

Embed+Pack is defined entirely by these protocols. Each is a typed Protocol with a named default; see the full protocols reference.

Embedder — text to vectors

@runtime_checkable
class Embedder(Protocol):
    """Turns text into vectors. Default: openai:text-embedding-3-small."""
    dim: int
    def embed(self, texts: list[str]) -> list[list[float]]: ...

The default is openai:text-embedding-3-small, a cloud embedder with 1536-dimensional vectors. The dim attribute is pinned into the archive manifest so a consumer can detect a dimension mismatch before querying. For local runs, use bge-m3 from the local profile. To weigh alternatives (BGE-M3, E5, OpenAI, Cohere), see choosing an embedder.

Store — the vector database adapter

@runtime_checkable
class Store(Protocol):
    """Vector database adapter. Default: qdrant.
    Also: pgvector, chroma, lancedb, jsonl."""
    def upsert(self, ids: list[str], vectors: list[list[float]],
               payloads: list[dict]) -> None: ...
    def query(self, vector: list[float], k: int = 5,
              filter: dict | None = None) -> list[tuple[str, float]]: ...
    def persist(self, dest: str) -> None:
        """Flush/export the store into the output `embeddings/` layout."""

The default is qdrant, which runs embedded (in-process / on-disk) or against a server with the same client code. Alternatives are pgvector, chroma, lancedb, and the zero-dependency jsonl store. upsert ingests vectors during the run; query powers space.search(...) and indx query; persist is what keeps archives portable (see below). See choosing a store.

OutputWriter — serializing the space to disk

@runtime_checkable
class OutputWriter(Protocol):
    """Serializes a KnowledgeSpace to disk. Default: .indx.
    Also: jsonl, langchain, llamaindex."""
    format: str
    def write(self, space: KnowledgeSpace, out: str) -> None: ...

The default writer is .indx. Alternatives emit jsonl shards or adapters for langchain / llamaindex. See output formats.

Concurrency: batching beats parallelism

Unlike Parse (stage 02), which fans out across files in a worker pool, embedding is driven by batching. Local embedders are far more efficient on batches because of GPU/CPU vectorization, and store upserts are batched to amortize round-trips — this is the single biggest performance lever for stage 06, and it applies whether the embedder is local or an API.

Param	Default	Notes
Embed batch size	`64`	Chunk texts are grouped into batches of 64 and submitted to `Embedder.embed(list[str])`.
Embed max concurrency	`--jobs`	Network-bound embedders use a bounded concurrency limit; local ones lean on batch size.
Store upsert	batched	Vectors are written in batches to amortize round-trips.

Store upserts are batched in lockstep so vectors land as soon as each batch is embedded. Batch size is overridable via adapter sub-tables / kwargs.

The on-disk output

Running indx ./docs --out ./ai-ready writes both the sealed archive and an expanded layout beside it, so downstream tools can read either form:

ai-ready/
├── handbook.indx        # the portable archive (a ZIP container)
├── index.json           # the knowledge graph
├── chunks/              # agent-readable chunks + per-chunk context
└── embeddings/          # vectors + manifest

Inside handbook.indx (a deflate ZIP), the layout is:

handbook.indx
├── manifest.json            # archive metadata + sha256 checksums
├── index.json               # the knowledge graph
├── chunks/
│   ├── chunk_0000.json
│   └── …
└── embeddings/
    ├── manifest.json         # model, dim, count, backend
    └── vectors.f32           # contiguous little-endian float32 matrix (count × dim)

Embeddings are not inlined in index.json — they live in embeddings/ as a contiguous little-endian float32 matrix and are memory-mapped on load. The embeddings manifest pins the embedder name and dim so a loaded archive can validate query-time compatibility. The full format is documented in the .indx archive reference.

Skipping the stage: `--no-embed`

Pass --no-embed (CLI) or drop("embed-pack") (SDK) to skip vectorization entirely and produce a graph-only space — documents, chunks, relations, and enrichment, but no vectors.

indx ./docs --out ./graph-only --no-embed

pipeline = DirectoryPipeline().drop("embed-pack")
space = pipeline.run("./docs", "./graph-only")

A graph-only space cannot answer space.search(...) / indx query, but it is useful when you only need the directory graph, relations, and metadata — or when you plan to embed later with a different model.

A minimal run

# requires the local profile: pip install "indx[bge,qdrant]"
from indx import DirectoryPipeline

pipeline = DirectoryPipeline(
    embedder="bge-m3",   # dim 1024
    store="qdrant",
    output=".indx",
)
space = pipeline.run("./docs", "./ai-ready")

print(space.stats.embeddings, space.stats.embed_dim)   # e.g. 1042 1024

for hit in space.search("how long is data retained?", k=3):
    print(hit.score, hit.source.path)

Embed+Pack produces exactly one vector per chunk — so for a fully embedded space space.stats.embeddings == space.stats.chunks. (A --no-embed run is the only case where they diverge: vectors drop to zero while chunks remain.)

The progress summary for this stage looks like:

  06 embed   1042 vectors → qdrant, sealed handbook.indx
done: 1042 chunks, 128 docs, embed_dim=1024  (12.4s)

Where to go next

The .indx archive — full container layout, manifest, sealing and loading.
Choosing an embedder — bge-m3 vs. E5, OpenAI, Cohere.
Choosing a store — Qdrant, pgvector, Chroma, LanceDB, JSONL.
Output formats — .indx, JSONL, LangChain, LlamaIndex writers.
Pipeline overview — how all six stages fit together.