Choosing an Embedder

The embedder turns chunk text into vectors during stage 06, Embed+Pack. It is the component that makes a knowledge space searchable, and its identity is recorded into the archive so consumers always know exactly which model produced the vectors. This guide helps you pick the right one and explains the consequences of changing it.

TL;DR

Default: openai:text-embedding-3-small — cloud-backed, light to install, dim 1536. Use bge-m3 when you need a fully local profile.
Lighter local English: e5.
No local GPU / already paying for an API: openai or cohere.
Local embedders are the heaviest optional path (they pull Torch); API embedders stay light.
The model identity (name) and dim are pinned into the archive manifest. Changing the embedder requires a full re-embed.

The Embedder protocol

Every embedder — built-in or third-party — satisfies the same typed protocol, so the pipeline never needs to know which one is active:

@runtime_checkable
class Embedder(Protocol):
    """Turns text into vectors. Default: openai:text-embedding-3-small."""
    dim: int
    def embed(self, texts: list[str]) -> list[list[float]]: ...

Two things matter for selection:

embed(texts) takes a list of strings and returns one vector (list[float]) per input. indx always calls it in batches (see Batching).
dim is the vector dimensionality. It is read once and pinned into the archive so query-time compatibility can be validated.

The options

Embedder	Runs	Strengths	Best for	Extra
`openai:text-embedding-3-small` (default)	API	No local GPU, no model download, dim 1536	Cloud-backed default and lightweight installs	`indx[openai]`
`bge-m3`	Local	Multilingual, long inputs, strong open-license retrieval, dim 1024	Local / air-gapped profile; mixed-language and document-heavy corpora	`indx[bge]` / `indx[embeddings-local]` (pulls Torch)
`e5`	Local	Lighter than BGE-M3, strong English retrieval	English-only corpora where you want a smaller local footprint	`indx[e5]` / `indx[embeddings-local]` (pulls Torch)
`openai`	API	No local GPU, no model download, managed quality	Teams already on OpenAI, or machines without a GPU	`indx[openai]` (light, HTTP only)
`cohere`	API	No local GPU, strong multilingual API models	Teams already on Cohere	`indx[cohere]` (light, HTTP only)

Why BGE-M3 remains the local default

BGE-M3 preserves indx’s local-first principle while being a genuinely strong general embedder:

Fully local — no API key, works air-gapped out of the box (see Local & air-gapped).
Multilingual and supports long inputs, which suits arbitrary directory contents (code, docs, mixed languages).
Strong retrieval quality among openly licensed models, with native dense embeddings (and hybrid/multi-vector modes available) — a good default without per-corpus tuning.
Dim 1024, which is what space.stats.embed_dim and the manifest report on a default build.

When to switch

Choose e5 if your corpus is English-only and you want a lighter local model than BGE-M3.
Choose openai or cohere if the machine running the build has no GPU (or you don’t want to download model weights), or your team already pays for those APIs. API embedders avoid the Torch dependency entirely.

Local vs. API: the dependency story

This is the single biggest practical difference between the options.

Local embedders (bge-m3, e5) load through FlagEmbedding (BGE’s reference implementation) or sentence-transformers. These are installed via indx[bge] / indx[e5] (or the umbrella indx[embeddings-local]), and they pull in Torch + model weights. This is the heaviest optional path in the whole project.
API embedders (openai, cohere) just make HTTP calls. Their extras (indx[openai], indx[cohere]) stay light — no Torch, no weights.

# Recommended local default stack (docling + local embeddings + qdrant):
pip install "indx[local]"

# Just the local embedder runtime (Torch comes with it):
pip install "indx[embeddings-local]"

# Light API embedders — no Torch:
pip install "indx[openai]"
pip install "indx[cohere]"

If you select an embedder whose extra is not installed, indx raises a clear MissingDependencyError naming the exact pip install "indx[...]" to run. See the full extras matrix.

Selecting an embedder

The embedder slot is resolved with the standard precedence: explicit code argument / use() → CLI flag → indx.toml → documented default.

CLI

indx ./docs --out ./ai-ready --embedder e5

`indx.toml`

[embed]
model = "openai:text-embedding-3-small" # any registered embedder name; use "bge-m3" for local

SDK — by name or by object

from indx import DirectoryPipeline

# By name string
pipeline = DirectoryPipeline(embedder="bge-m3", store="qdrant")

# Or swap later
pipeline.use(embedder="openai")

# Or pass a custom object satisfying the Embedder protocol
class MyEmbedder:
    dim = 768
    def embed(self, texts: list[str]) -> list[list[float]]:
        ...

pipeline.use(embedder=MyEmbedder())

For authoring your own embedder backend, see Custom components and Adding a backend.

Batching

Embedding is batched, which is the single biggest performance lever for this stage. Chunk texts are grouped into batches (default 64) and submitted to Embedder.embed(list[str]); the resulting vectors are then written to the store with batched upsert calls.

Param	Default
Embed batch size	64
Embed max concurrency	`--jobs`

Local models are far more efficient on batches (CPU/GPU vectorization); API embedders amortize round-trips the same way and use a bounded concurrency limit to respect rate limits. Tune batch size via the embedder’s adapter sub-table or kwargs — see Performance.

Model identity is pinned into the archive

When stage 06 seals the .indx archive, the embedder’s identity is written into two places:

manifest.json at the archive root:

{
  "embedder": { "name": "bge-m3", "dim": 1024 },
  "store": "qdrant"
}

and a dedicated embeddings/manifest.json alongside the raw vector matrix:

handbook.indx
├── manifest.json            # embedder name + dim, checksums, counts
├── index.json
├── chunks/
└── embeddings/
    ├── manifest.json        # model, dim, count, backend
    └── vectors.f32          # contiguous little-endian float32 matrix (count × dim)

This makes the archive self-describing: a consumer knows exactly which model produced the vectors. Vectors are stored as little-endian float32 so the matrix is count × dim.

Changing the embedder means re-embedding

Vectors from one model are not comparable with vectors from another, so changing the embedder requires a full re-embed. This is reflected in the cache and resume behavior:

With --resume, changing the embedder invalidates only the Embed stage — Walk, Parse, Chunk, Relate, and Enrich outputs are reused from cache. Changing the parser, by contrast, invalidates Parse and everything downstream.
The resolved config snapshot (including the embedder name) is recorded in index.json.metadata and the manifest for auditability.

# Switch embedders; everything upstream is reused, only vectors are recomputed.
indx ./docs --out ./ai-ready --embedder e5 --resume

Embed+Pack stage — what stage 06 does end to end.
The .indx archive — full archive layout and manifest fields.
Extras reference — every pip install indx[...] option.
Choosing a store — where the vectors land.
Bring your own stack — how slots and protocols fit together.