Choosing an Embedder
The embedder turns chunk text into vectors during stage 06, Embed+Pack. It is the component that makes a knowledge space searchable, and its identity is recorded into the archive so consumers always know exactly which model produced the vectors. This guide helps you pick the right one and explains the consequences of changing it.
- Default:
openai:text-embedding-3-small— cloud-backed, light to install, dim 1536. Usebge-m3when you need a fully local profile. - Lighter local English:
e5. - No local GPU / already paying for an API:
openaiorcohere. - Local embedders are the heaviest optional path (they pull Torch); API embedders stay light.
- The model identity (
name) anddimare pinned into the archive manifest. Changing the embedder requires a full re-embed.
The Embedder protocol
Section titled “The Embedder protocol”Every embedder — built-in or third-party — satisfies the same typed protocol, so the pipeline never needs to know which one is active:
@runtime_checkableclass Embedder(Protocol): """Turns text into vectors. Default: openai:text-embedding-3-small.""" dim: int def embed(self, texts: list[str]) -> list[list[float]]: ...Two things matter for selection:
embed(texts)takes a list of strings and returns one vector (list[float]) per input. indx always calls it in batches (see Batching).dimis the vector dimensionality. It is read once and pinned into the archive so query-time compatibility can be validated.
The options
Section titled “The options”| Embedder | Runs | Strengths | Best for | Extra |
|---|---|---|---|---|
openai:text-embedding-3-small (default) | API | No local GPU, no model download, dim 1536 | Cloud-backed default and lightweight installs | indx[openai] |
bge-m3 | Local | Multilingual, long inputs, strong open-license retrieval, dim 1024 | Local / air-gapped profile; mixed-language and document-heavy corpora | indx[bge] / indx[embeddings-local] (pulls Torch) |
e5 | Local | Lighter than BGE-M3, strong English retrieval | English-only corpora where you want a smaller local footprint | indx[e5] / indx[embeddings-local] (pulls Torch) |
openai | API | No local GPU, no model download, managed quality | Teams already on OpenAI, or machines without a GPU | indx[openai] (light, HTTP only) |
cohere | API | No local GPU, strong multilingual API models | Teams already on Cohere | indx[cohere] (light, HTTP only) |
Why BGE-M3 remains the local default
Section titled “Why BGE-M3 remains the local default”BGE-M3 preserves indx’s local-first principle while being a genuinely strong general embedder:
- Fully local — no API key, works air-gapped out of the box (see Local & air-gapped).
- Multilingual and supports long inputs, which suits arbitrary directory contents (code, docs, mixed languages).
- Strong retrieval quality among openly licensed models, with native dense embeddings (and hybrid/multi-vector modes available) — a good default without per-corpus tuning.
- Dim 1024, which is what
space.stats.embed_dimand the manifest report on a default build.
When to switch
Section titled “When to switch”- Choose
e5if your corpus is English-only and you want a lighter local model than BGE-M3. - Choose
openaiorcohereif the machine running the build has no GPU (or you don’t want to download model weights), or your team already pays for those APIs. API embedders avoid the Torch dependency entirely.
Local vs. API: the dependency story
Section titled “Local vs. API: the dependency story”This is the single biggest practical difference between the options.
- Local embedders (
bge-m3,e5) load through FlagEmbedding (BGE’s reference implementation) or sentence-transformers. These are installed viaindx[bge]/indx[e5](or the umbrellaindx[embeddings-local]), and they pull in Torch + model weights. This is the heaviest optional path in the whole project. - API embedders (
openai,cohere) just make HTTP calls. Their extras (indx[openai],indx[cohere]) stay light — no Torch, no weights.
# Recommended local default stack (docling + local embeddings + qdrant):pip install "indx[local]"
# Just the local embedder runtime (Torch comes with it):pip install "indx[embeddings-local]"
# Light API embedders — no Torch:pip install "indx[openai]"pip install "indx[cohere]"If you select an embedder whose extra is not installed, indx raises a clear MissingDependencyError naming the exact pip install "indx[...]" to run. See the full extras matrix.
Selecting an embedder
Section titled “Selecting an embedder”The embedder slot is resolved with the standard precedence: explicit code argument / use() → CLI flag → indx.toml → documented default.
indx ./docs --out ./ai-ready --embedder e5indx.toml
Section titled “indx.toml”[embed]model = "openai:text-embedding-3-small" # any registered embedder name; use "bge-m3" for localSDK — by name or by object
Section titled “SDK — by name or by object”from indx import DirectoryPipeline
# By name stringpipeline = DirectoryPipeline(embedder="bge-m3", store="qdrant")
# Or swap laterpipeline.use(embedder="openai")
# Or pass a custom object satisfying the Embedder protocolclass MyEmbedder: dim = 768 def embed(self, texts: list[str]) -> list[list[float]]: ...
pipeline.use(embedder=MyEmbedder())For authoring your own embedder backend, see Custom components and Adding a backend.
Batching
Section titled “Batching”Embedding is batched, which is the single biggest performance lever for this stage. Chunk texts are grouped into batches (default 64) and submitted to Embedder.embed(list[str]); the resulting vectors are then written to the store with batched upsert calls.
| Param | Default |
|---|---|
| Embed batch size | 64 |
| Embed max concurrency | --jobs |
Local models are far more efficient on batches (CPU/GPU vectorization); API embedders amortize round-trips the same way and use a bounded concurrency limit to respect rate limits. Tune batch size via the embedder’s adapter sub-table or kwargs — see Performance.
Model identity is pinned into the archive
Section titled “Model identity is pinned into the archive”When stage 06 seals the .indx archive, the embedder’s identity is written into two places:
manifest.json at the archive root:
{ "embedder": { "name": "bge-m3", "dim": 1024 }, "store": "qdrant"}and a dedicated embeddings/manifest.json alongside the raw vector matrix:
handbook.indx├── manifest.json # embedder name + dim, checksums, counts├── index.json├── chunks/└── embeddings/ ├── manifest.json # model, dim, count, backend └── vectors.f32 # contiguous little-endian float32 matrix (count × dim)This makes the archive self-describing: a consumer knows exactly which model produced the vectors. Vectors are stored as little-endian float32 so the matrix is count × dim.
Changing the embedder means re-embedding
Section titled “Changing the embedder means re-embedding”Vectors from one model are not comparable with vectors from another, so changing the embedder requires a full re-embed. This is reflected in the cache and resume behavior:
- With
--resume, changing the embedder invalidates only the Embed stage — Walk, Parse, Chunk, Relate, and Enrich outputs are reused from cache. Changing the parser, by contrast, invalidates Parse and everything downstream. - The resolved config snapshot (including the embedder name) is recorded in
index.json.metadataand the manifest for auditability.
# Switch embedders; everything upstream is reused, only vectors are recomputed.indx ./docs --out ./ai-ready --embedder e5 --resumeRelated pages
Section titled “Related pages”- Embed+Pack stage — what stage 06 does end to end.
- The
.indxarchive — full archive layout and manifest fields. - Extras reference — every
pip install indx[...]option. - Choosing a store — where the vectors land.
- Bring your own stack — how slots and protocols fit together.