Choosing a Vector Store
The Store slot is where your chunk embeddings live and where space.search(...) and indx query run their nearest-neighbour lookups. indx ships five backends behind one typed protocol, so picking a store is a one-line config change — never a rewrite. The default, qdrant, gives you a real ANN index that runs embedded or as a server; the zero-dependency jsonl store is the floor that guarantees a fully local, fully portable run with nothing installed.
This page helps you choose. For the embedding model that fills the store, see Choosing an Embedder; for running everything offline, see Local & Air-Gapped.
The decision at a glance
Section titled “The decision at a glance”| Store | Choose it when | Extra |
|---|---|---|
qdrant (default) | General use. You want a real approximate-nearest-neighbour (ANN) index that works both embedded / on-disk and as a server with the same client code, plus rich payload filtering. Apache-2.0. | indx[qdrant] |
pgvector | Your org already runs PostgreSQL and wants vectors sitting beside relational data on one operational surface. | indx[pgvector] |
chroma | Quick local prototyping with a simple, popular API. | indx[chroma] |
lancedb | Large, file-based columnar vector data with zero server and good on-disk performance. | indx[lancedb] |
jsonl (no DB) | Air-gapped / fully portable / zero-dependency indexes. Brute-force linear scan over small corpora; maximum portability inside the .indx archive. | none — ships in core |
How a Store fits the pipeline
Section titled “How a Store fits the pipeline”A Store is consumed in stage 06 Embed+Pack (see the pipeline overview). The Embedder turns chunk text into vectors, those vectors are upserted into the store, and then the store is persisted into the archive’s embeddings/ layout so the resulting .indx stays portable regardless of which backend produced it.
The contract is the Store protocol (the normative names are from the protocols reference):
from typing import Protocol, runtime_checkable
@runtime_checkableclass Store(Protocol): """Vector database adapter. Default: qdrant. Also: pgvector, chroma, lancedb, jsonl."""
def upsert(self, ids: list[str], vectors: list[list[float]], payloads: list[dict]) -> None: ...
def query(self, vector: list[float], k: int = 5, filter: dict | None = None) -> list[tuple[str, float]]: ...
def persist(self, dest: str) -> None: """Flush/export the store into the output `embeddings/` layout."""upsertwrites batches of(id, vector, payload)records. indx batches chunk vectors (default batch size 64) before calling it.queryreturns the top-k(chunk_id, score)pairs for a query vector, with optional payloadfilter. This is what powersspace.search(...)andindx query.persistis the portability guarantee — see below.
Because the slot is a structural Protocol, any third-party class that implements these three methods works as a store with no subclassing and no fork of indx. See Custom Components and Authoring a Plugin.
Persistence: why your archive stays portable
Section titled “Persistence: why your archive stays portable”The thing that keeps a .indx archive openable on any machine — even one without your database — is persist(). When stage 06 seals the archive, it calls Store.persist(dest) to materialise the embeddings into the standard embeddings/ layout:
handbook.indx (ZIP container)├── manifest.json # counts, embedder name + dim, store name, checksums├── index.json # the knowledge graph├── chunks/ # per-chunk JSON└── embeddings/ ├── manifest.json # model, dim, count, backend └── vectors.f32 # contiguous little-endian float32 matrix (count × dim)So no matter which backend you used during the build, the vectors travel inside the archive as a float32 matrix, and the embedder name/dim are pinned in the manifest. A consumer that does KnowledgeSpace.load(...) can search the space without ever talking to Qdrant or Postgres.
Choosing a backend
Section titled “Choosing a backend”Qdrant (default)
Section titled “Qdrant (default)”The all-rounder. Qdrant gives you a genuine ANN index with rich payload filtering, strong performance, and an Apache-2.0 license, so there is no lock-in. Its standout property for indx is that the same client runs embedded (in-process / local on-disk path) and against a remote server — you start local and graduate to a hosted deployment without touching your code. Pick Qdrant unless you have a specific reason not to.
pgvector
Section titled “pgvector”The pragmatic choice for teams already standardised on PostgreSQL. Keeping vectors in Postgres means one backup story, one access-control model, and the ability to JOIN vector results against your existing relational data. Choose it to consolidate operational surface area rather than running a second datastore.
Chroma
Section titled “Chroma”A simple, popular API that is great for quick local prototyping. Reach for Chroma when you want to spin something up fast and are not yet worried about scale or production operations.
LanceDB
Section titled “LanceDB”A file-based, columnar vector store with no server and good on-disk performance. LanceDB shines when you have a large vector set you want to keep as files (easy to copy, version, and stage) without standing up a service.
JSONL (no DB)
Section titled “JSONL (no DB)”The zero-dependency floor. The jsonl store ships in the core install, requires no database, and stores vectors inline in the archive, so indx <dir> --out <dir> produces a usable knowledge space with nothing else installed. It searches by brute-force linear scan, which is fine for small spaces and unbeatable for maximum portability and air-gapped delivery — but it is not an ANN index, so it does not scale to large corpora.
Selecting your store
Section titled “Selecting your store”You can pick a store three ways. Precedence is explicit code argument / use() → CLI flag → indx.toml → documented default.
On the CLI
Section titled “On the CLI”# Default Qdrant (embedded/on-disk)indx ./docs --out ./ai-ready
# Zero-dependency, fully portableindx ./docs --out ./ai-ready --store jsonl
# pgvectorindx ./docs --out ./ai-ready --store pgvectorIn indx.toml
Section titled “In indx.toml”The store slot lives in the [store] section. Backend-specific options go in a [store.<backend>] sub-table; those keys are passed verbatim to the adapter constructor and are otherwise opaque to the core.
[store]backend = "qdrant" # qdrant | pgvector | chroma | lancedb | jsonl
[store.qdrant]url = "http://localhost:6333" # adapter-specific; omit to run embedded/on-disk[store]backend = "pgvector"
[store.pgvector]dsn = "postgresql://localhost:5432/indx" # connection target; the password comes from env, not the file# Override or supply the secret via env: INDX_STORE__PGVECTOR__DSN="postgresql://user:pass@host:5432/indx"In Python
Section titled “In Python”Pass a name string or a custom instance to the pipeline, or swap it later with use():
from indx import DirectoryPipeline
# By namepipeline = DirectoryPipeline(embedder="bge-m3", store="lancedb")space = pipeline.run("./docs", "./ai-ready")
# Swap by keyword later (accepts a name or an instance)pipeline.use(store="jsonl")Every backend is an extra
Section titled “Every backend is an extra”The core install (pip install indx) depends on no database client — only the jsonl store ships in core. Each real database backend is an optional extra, so the core stays light and a bare install still runs fully offline:
pip install "indx[qdrant]" # QdrantStore → qdrant-clientpip install "indx[pgvector]" # PgVectorStore → psycopg + pgvectorpip install "indx[chroma]" # ChromaStore → chromadbpip install "indx[lancedb]" # LanceDBStore → lancedbIf you select a backend whose extra is not installed, indx raises a single actionable error (MissingDependencyError) naming the exact pip install indx[...] to run — and only when that slot is actually selected, so unrelated runs are never affected. The recommended local / air-gapped bundle is pip install "indx[defaults]" (Docling + Ollama + BGE-M3 + Qdrant). See the extras reference for the full matrix.
Things to keep in mind
Section titled “Things to keep in mind”- Dimension safety. The embedder’s
nameanddim(1024 for localbge-m3, 1536 for the default OpenAI embedder) are recorded in the archive manifest, so a loaded archive can detect a vector/dimension mismatch before querying. Changing the embedder means re-embedding. - Batching is the lever. Upserts are batched (default 64) to amortise round-trips; tune batch size and concurrency per backend for throughput. See Performance.
- Portability is guaranteed regardless of backend. Whichever store you build with,
persist()materialises the sameembeddings/layout — so a teammate can open the.indxwithjsonl-style inline search even if they do not run your database. See the .indx archive reference.
See also
Section titled “See also”- Local & Air-Gapped — building and querying with zero network or external services.
- Inspect & Query — reading stats and running searches against a sealed archive.
- Protocols reference — the full
Storecontract and the other slot protocols. - Registry & Defaults — how store names resolve to classes.
- Extras reference — the complete
pip install indx[...]matrix.