Skip to content

Choosing an Embedder

The embedder turns chunk text into vectors during stage 06, Embed+Pack. It is the component that makes a knowledge space searchable, and its identity is recorded into the archive so consumers always know exactly which model produced the vectors. This guide helps you pick the right one and explains the consequences of changing it.

  • Default: openai:text-embedding-3-small — cloud-backed, light to install, dim 1536. Use bge-m3 when you need a fully local profile.
  • Lighter local English: e5.
  • No local GPU / already paying for an API: openai or cohere.
  • Local embedders are the heaviest optional path (they pull Torch); API embedders stay light.
  • The model identity (name) and dim are pinned into the archive manifest. Changing the embedder requires a full re-embed.

Every embedder — built-in or third-party — satisfies the same typed protocol, so the pipeline never needs to know which one is active:

@runtime_checkable
class Embedder(Protocol):
"""Turns text into vectors. Default: openai:text-embedding-3-small."""
dim: int
def embed(self, texts: list[str]) -> list[list[float]]: ...

Two things matter for selection:

  • embed(texts) takes a list of strings and returns one vector (list[float]) per input. indx always calls it in batches (see Batching).
  • dim is the vector dimensionality. It is read once and pinned into the archive so query-time compatibility can be validated.
EmbedderRunsStrengthsBest forExtra
openai:text-embedding-3-small (default)APINo local GPU, no model download, dim 1536Cloud-backed default and lightweight installsindx[openai]
bge-m3LocalMultilingual, long inputs, strong open-license retrieval, dim 1024Local / air-gapped profile; mixed-language and document-heavy corporaindx[bge] / indx[embeddings-local] (pulls Torch)
e5LocalLighter than BGE-M3, strong English retrievalEnglish-only corpora where you want a smaller local footprintindx[e5] / indx[embeddings-local] (pulls Torch)
openaiAPINo local GPU, no model download, managed qualityTeams already on OpenAI, or machines without a GPUindx[openai] (light, HTTP only)
cohereAPINo local GPU, strong multilingual API modelsTeams already on Cohereindx[cohere] (light, HTTP only)

BGE-M3 preserves indx’s local-first principle while being a genuinely strong general embedder:

  • Fully local — no API key, works air-gapped out of the box (see Local & air-gapped).
  • Multilingual and supports long inputs, which suits arbitrary directory contents (code, docs, mixed languages).
  • Strong retrieval quality among openly licensed models, with native dense embeddings (and hybrid/multi-vector modes available) — a good default without per-corpus tuning.
  • Dim 1024, which is what space.stats.embed_dim and the manifest report on a default build.
  • Choose e5 if your corpus is English-only and you want a lighter local model than BGE-M3.
  • Choose openai or cohere if the machine running the build has no GPU (or you don’t want to download model weights), or your team already pays for those APIs. API embedders avoid the Torch dependency entirely.

This is the single biggest practical difference between the options.

  • Local embedders (bge-m3, e5) load through FlagEmbedding (BGE’s reference implementation) or sentence-transformers. These are installed via indx[bge] / indx[e5] (or the umbrella indx[embeddings-local]), and they pull in Torch + model weights. This is the heaviest optional path in the whole project.
  • API embedders (openai, cohere) just make HTTP calls. Their extras (indx[openai], indx[cohere]) stay light — no Torch, no weights.
Terminal window
# Recommended local default stack (docling + local embeddings + qdrant):
pip install "indx[local]"
# Just the local embedder runtime (Torch comes with it):
pip install "indx[embeddings-local]"
# Light API embedders — no Torch:
pip install "indx[openai]"
pip install "indx[cohere]"

If you select an embedder whose extra is not installed, indx raises a clear MissingDependencyError naming the exact pip install "indx[...]" to run. See the full extras matrix.

The embedder slot is resolved with the standard precedence: explicit code argument / use() → CLI flag → indx.toml → documented default.

Terminal window
indx ./docs --out ./ai-ready --embedder e5
[embed]
model = "openai:text-embedding-3-small" # any registered embedder name; use "bge-m3" for local
from indx import DirectoryPipeline
# By name string
pipeline = DirectoryPipeline(embedder="bge-m3", store="qdrant")
# Or swap later
pipeline.use(embedder="openai")
# Or pass a custom object satisfying the Embedder protocol
class MyEmbedder:
dim = 768
def embed(self, texts: list[str]) -> list[list[float]]:
...
pipeline.use(embedder=MyEmbedder())

For authoring your own embedder backend, see Custom components and Adding a backend.

Embedding is batched, which is the single biggest performance lever for this stage. Chunk texts are grouped into batches (default 64) and submitted to Embedder.embed(list[str]); the resulting vectors are then written to the store with batched upsert calls.

ParamDefault
Embed batch size64
Embed max concurrency--jobs

Local models are far more efficient on batches (CPU/GPU vectorization); API embedders amortize round-trips the same way and use a bounded concurrency limit to respect rate limits. Tune batch size via the embedder’s adapter sub-table or kwargs — see Performance.

When stage 06 seals the .indx archive, the embedder’s identity is written into two places:

manifest.json at the archive root:

{
"embedder": { "name": "bge-m3", "dim": 1024 },
"store": "qdrant"
}

and a dedicated embeddings/manifest.json alongside the raw vector matrix:

handbook.indx
├── manifest.json # embedder name + dim, checksums, counts
├── index.json
├── chunks/
└── embeddings/
├── manifest.json # model, dim, count, backend
└── vectors.f32 # contiguous little-endian float32 matrix (count × dim)

This makes the archive self-describing: a consumer knows exactly which model produced the vectors. Vectors are stored as little-endian float32 so the matrix is count × dim.

Vectors from one model are not comparable with vectors from another, so changing the embedder requires a full re-embed. This is reflected in the cache and resume behavior:

  • With --resume, changing the embedder invalidates only the Embed stage — Walk, Parse, Chunk, Relate, and Enrich outputs are reused from cache. Changing the parser, by contrast, invalidates Parse and everything downstream.
  • The resolved config snapshot (including the embedder name) is recorded in index.json.metadata and the manifest for auditability.
Terminal window
# Switch embedders; everything upstream is reused, only vectors are recomputed.
indx ./docs --out ./ai-ready --embedder e5 --resume