Skip to content

Choosing a Vector Store

The Store slot is where your chunk embeddings live and where space.search(...) and indx query run their nearest-neighbour lookups. indx ships five backends behind one typed protocol, so picking a store is a one-line config change — never a rewrite. The default, qdrant, gives you a real ANN index that runs embedded or as a server; the zero-dependency jsonl store is the floor that guarantees a fully local, fully portable run with nothing installed.

This page helps you choose. For the embedding model that fills the store, see Choosing an Embedder; for running everything offline, see Local & Air-Gapped.

StoreChoose it whenExtra
qdrant (default)General use. You want a real approximate-nearest-neighbour (ANN) index that works both embedded / on-disk and as a server with the same client code, plus rich payload filtering. Apache-2.0.indx[qdrant]
pgvectorYour org already runs PostgreSQL and wants vectors sitting beside relational data on one operational surface.indx[pgvector]
chromaQuick local prototyping with a simple, popular API.indx[chroma]
lancedbLarge, file-based columnar vector data with zero server and good on-disk performance.indx[lancedb]
jsonl (no DB)Air-gapped / fully portable / zero-dependency indexes. Brute-force linear scan over small corpora; maximum portability inside the .indx archive.none — ships in core

A Store is consumed in stage 06 Embed+Pack (see the pipeline overview). The Embedder turns chunk text into vectors, those vectors are upserted into the store, and then the store is persisted into the archive’s embeddings/ layout so the resulting .indx stays portable regardless of which backend produced it.

The contract is the Store protocol (the normative names are from the protocols reference):

from typing import Protocol, runtime_checkable
@runtime_checkable
class Store(Protocol):
"""Vector database adapter. Default: qdrant.
Also: pgvector, chroma, lancedb, jsonl."""
def upsert(self, ids: list[str], vectors: list[list[float]],
payloads: list[dict]) -> None: ...
def query(self, vector: list[float], k: int = 5,
filter: dict | None = None) -> list[tuple[str, float]]: ...
def persist(self, dest: str) -> None:
"""Flush/export the store into the output `embeddings/` layout."""
  • upsert writes batches of (id, vector, payload) records. indx batches chunk vectors (default batch size 64) before calling it.
  • query returns the top-k (chunk_id, score) pairs for a query vector, with optional payload filter. This is what powers space.search(...) and indx query.
  • persist is the portability guarantee — see below.

Because the slot is a structural Protocol, any third-party class that implements these three methods works as a store with no subclassing and no fork of indx. See Custom Components and Authoring a Plugin.

Persistence: why your archive stays portable

Section titled “Persistence: why your archive stays portable”

The thing that keeps a .indx archive openable on any machine — even one without your database — is persist(). When stage 06 seals the archive, it calls Store.persist(dest) to materialise the embeddings into the standard embeddings/ layout:

handbook.indx (ZIP container)
├── manifest.json # counts, embedder name + dim, store name, checksums
├── index.json # the knowledge graph
├── chunks/ # per-chunk JSON
└── embeddings/
├── manifest.json # model, dim, count, backend
└── vectors.f32 # contiguous little-endian float32 matrix (count × dim)

So no matter which backend you used during the build, the vectors travel inside the archive as a float32 matrix, and the embedder name/dim are pinned in the manifest. A consumer that does KnowledgeSpace.load(...) can search the space without ever talking to Qdrant or Postgres.

The all-rounder. Qdrant gives you a genuine ANN index with rich payload filtering, strong performance, and an Apache-2.0 license, so there is no lock-in. Its standout property for indx is that the same client runs embedded (in-process / local on-disk path) and against a remote server — you start local and graduate to a hosted deployment without touching your code. Pick Qdrant unless you have a specific reason not to.

The pragmatic choice for teams already standardised on PostgreSQL. Keeping vectors in Postgres means one backup story, one access-control model, and the ability to JOIN vector results against your existing relational data. Choose it to consolidate operational surface area rather than running a second datastore.

A simple, popular API that is great for quick local prototyping. Reach for Chroma when you want to spin something up fast and are not yet worried about scale or production operations.

A file-based, columnar vector store with no server and good on-disk performance. LanceDB shines when you have a large vector set you want to keep as files (easy to copy, version, and stage) without standing up a service.

The zero-dependency floor. The jsonl store ships in the core install, requires no database, and stores vectors inline in the archive, so indx <dir> --out <dir> produces a usable knowledge space with nothing else installed. It searches by brute-force linear scan, which is fine for small spaces and unbeatable for maximum portability and air-gapped delivery — but it is not an ANN index, so it does not scale to large corpora.

You can pick a store three ways. Precedence is explicit code argument / use() → CLI flag → indx.toml → documented default.

Terminal window
# Default Qdrant (embedded/on-disk)
indx ./docs --out ./ai-ready
# Zero-dependency, fully portable
indx ./docs --out ./ai-ready --store jsonl
# pgvector
indx ./docs --out ./ai-ready --store pgvector

The store slot lives in the [store] section. Backend-specific options go in a [store.<backend>] sub-table; those keys are passed verbatim to the adapter constructor and are otherwise opaque to the core.

[store]
backend = "qdrant" # qdrant | pgvector | chroma | lancedb | jsonl
[store.qdrant]
url = "http://localhost:6333" # adapter-specific; omit to run embedded/on-disk
[store]
backend = "pgvector"
[store.pgvector]
dsn = "postgresql://localhost:5432/indx" # connection target; the password comes from env, not the file
# Override or supply the secret via env: INDX_STORE__PGVECTOR__DSN="postgresql://user:pass@host:5432/indx"

Pass a name string or a custom instance to the pipeline, or swap it later with use():

from indx import DirectoryPipeline
# By name
pipeline = DirectoryPipeline(embedder="bge-m3", store="lancedb")
space = pipeline.run("./docs", "./ai-ready")
# Swap by keyword later (accepts a name or an instance)
pipeline.use(store="jsonl")

The core install (pip install indx) depends on no database client — only the jsonl store ships in core. Each real database backend is an optional extra, so the core stays light and a bare install still runs fully offline:

Terminal window
pip install "indx[qdrant]" # QdrantStore → qdrant-client
pip install "indx[pgvector]" # PgVectorStore → psycopg + pgvector
pip install "indx[chroma]" # ChromaStore → chromadb
pip install "indx[lancedb]" # LanceDBStore → lancedb

If you select a backend whose extra is not installed, indx raises a single actionable error (MissingDependencyError) naming the exact pip install indx[...] to run — and only when that slot is actually selected, so unrelated runs are never affected. The recommended local / air-gapped bundle is pip install "indx[defaults]" (Docling + Ollama + BGE-M3 + Qdrant). See the extras reference for the full matrix.

  • Dimension safety. The embedder’s name and dim (1024 for local bge-m3, 1536 for the default OpenAI embedder) are recorded in the archive manifest, so a loaded archive can detect a vector/dimension mismatch before querying. Changing the embedder means re-embedding.
  • Batching is the lever. Upserts are batched (default 64) to amortise round-trips; tune batch size and concurrency per backend for throughput. See Performance.
  • Portability is guaranteed regardless of backend. Whichever store you build with, persist() materialises the same embeddings/ layout — so a teammate can open the .indx with jsonl-style inline search even if they do not run your database. See the .indx archive reference.