Testing

indx’s test suite is built to run offline, fast, and deterministically — the same values the product itself ships with. The default suite never touches the network, a GPU, or an external service, so any contributor (and CI on every PR) can run it on a laptop with a bare pip install indx. This page is the reference for how the suite is organized and the rules new tests must follow.

The framework is pytest (with fixtures and parametrization). The standards below come from the coding standards §11, the testing technology selection §13, and the repository layout §7.

Test layout: `tests/` mirrors `src/indx/`

The tests/ tree maps one-to-one onto src/indx/. A test for src/indx/store/jsonl.py lives at the mirrored path under tests/. On top of that mirroring, tests are split into a fast offline unit/ tier and a gated integration/ tier, with shared fixtures and sample inputs alongside.

tests/
├── conftest.py            # Shared fixtures: tmp space, SpaceContext, fake clients
├── fixtures/
│   ├── sample_dir/        # Small tree to walk/parse end to end
│   ├── docs/              # Sample pdf/docx/md/png inputs
│   ├── indx_toml/         # Valid + invalid indx.toml configs
│   └── archives/          # Reference .indx archives for round-trip tests
├── unit/                  # No network, no GPU, no heavy deps — uses fakes/stubs
│   ├── core/              # models: KnowledgeSpace, Document, Chunk, Relation, context
│   ├── pipeline/          # orchestration + one test file per stage
│   ├── parsers/           # protocol test + plaintext (zero-dep) parser
│   ├── store/             # jsonl (zero-dep) store
│   ├── output/            # jsonl + .indx writers
│   ├── archive/           # write -> read round-trip vs fixtures
│   ├── config/            # schema + loader precedence
│   ├── registry/          # name -> class resolution, plugin discovery
│   ├── cli/               # Typer CliRunner: main / inspect / query
│   └── utils/             # hashing, require_extra() guidance
├── integration/           # @pytest.mark.integration — needs extras + services
│   ├── parsers/           # test_docling.py (requires indx[docling])
│   ├── llm/               # test_ollama.py (requires a running Ollama)
│   ├── embed/             # test_bge_m3.py (requires indx[bge], downloads model)
│   ├── store/             # test_qdrant.py, test_pgvector.py (require services)
│   └── test_end_to_end.py # build -> inspect -> query on the default stack
└── plugins/
    └── fake_plugin/       # Minimal installable dist for entry-point discovery

Because the swappable contracts (Parser, LLM, VLM, Embedder, Store, OutputWriter, Stage) live in each slot’s base.py, a unit test can substitute a Protocol-typed fake for any slot and never import a heavy backend. See the protocol reference for the contracts these fakes satisfy.

Unit vs integration

	Unit	Integration
Marker	(none — the default)	`@pytest.mark.integration`
Network / GPU / services	None — forbidden	Allowed (the point of the tier)
Dependencies	Core only (+ zero-dep fallbacks)	Backend extras (`indx[docling]`, `indx[qdrant]`, …)
When it runs	Every CI matrix cell, every PR	`integration.yml`, gated; skipped when the extra/service is absent
Speed	Fast, deterministic	Slower, environment-dependent

Unit tests run with no network, no GPU, and no external service. They cover the dependency-light core, the pipeline stages, and the zero-dependency fallbacks that ship in core — the plaintext parser, the jsonl store, the none VLM, and the .indx + jsonl writers — so a complete pipeline path is exercised entirely offline.

Integration tests carry @pytest.mark.integration and exercise real backends — Docling, a running Ollama, the BGE-M3 model download, a live Qdrant or Postgres+pgvector service. They are gated behind their extras/services and skip automatically when those are absent, so a contributor without GPU or Docker still gets a green unit run.

import pytest

@pytest.mark.integration
def test_qdrant_upsert_and_search(qdrant_service):
    """Round-trip against a real Qdrant. Skipped when the service is down."""
    ...

# Default: the fast, offline suite (what PR CI runs)
pytest

# Run only the integration tier (needs the extras/services installed)
pytest -m integration

# Exclude integration explicitly
pytest -m "not integration"

Mock / VCR every network model call

Real provider calls never run in the default suite. Every LLM, VLM, and embedding network call is either replaced with a Protocol-satisfying fake or replayed from a recorded VCR-style cassette (pytest-recording / VCR). This keeps CI free of API keys and provider flakiness, and it mirrors indx’s own local-first ethos: the tests, like the tool, work air-gapped.

For LLM / VLM / Store slots, prefer a fake that satisfies the Protocol — it is a drop-in because the contract lives in base.py.
For HTTP-backed adapters (OpenAI, Anthropic, Cohere, hosted Qdrant), record a cassette once and replay it; re-record deliberately when a provider’s response shape genuinely changes.
A small, opt-in --live marker hits real backends. It runs nightly/manually — never in the default PR check.

import pytest

@pytest.mark.live  # opt-in; excluded from the default offline run
def test_openai_embedder_live(openai_key):
    """Hits the real API. Only runs with --live."""
    ...

# Replay cassettes / fakes only (default — offline & deterministic)
pytest

# Opt in to the real-backend tests
pytest --live

Golden-file determinism tests

indx’s contract with its users is the serialized artifact: a byte-stable, diff-friendly index.json and .indx manifest. Golden-file tests are the determinism guardrail behind reproducibility: run the pipeline on a fixed sample directory and assert byte-for-byte equality against a committed golden file. Any unintended change to serialization — field reordering, a stray float format, a new key — fails loudly and shows up as a reviewable diff in the PR.

from pathlib import Path
from indx import DirectoryPipeline

def test_index_json_is_byte_stable(tmp_path):
    space = DirectoryPipeline(seed=0).run("tests/fixtures/sample_dir", tmp_path)
    produced = (tmp_path / "index.json").read_text()
    golden = Path("tests/fixtures/archives/sample.index.json").read_text()
    assert produced == golden

This works because the core models guarantee stable serialization: fixed field ordering, sorted keys for free-form maps, UTC ISO-8601 timestamps, fixed float representation, and a top-level schema_version on every space. See the index.json reference and the .indx archive reference for the exact shapes pinned here.

Protocol-conformance fixture

A shared protocol-conformance fixture asserts that every Parser, LLM, VLM, Embedder, Store, and Writer actually satisfies its Protocol and round-trips core types — accepting and returning only domain types (ParsedDoc, Chunk, Relation, …), never leaking a vendor object. Because the protocols are structural (typing.Protocol, @runtime_checkable where useful), conformance can be checked at runtime as well as by the type checker.

import pytest
from indx.store import Store

def test_store_satisfies_protocol(store_impl):
    assert isinstance(store_impl, Store)  # structural check

def test_store_round_trips_core_types(store_impl, sample_chunks):
    store_impl.upsert(ids, vectors, payloads)
    hits = store_impl.query(vector=[0.0] * 1024, k=5)
    # only core SearchHit objects come back — no vendor records leak out
    assert all(hasattr(h, "chunk_id") for h in hits)

Every new backend you contribute must add this adapter contract test — it is part of the recipe in adding a backend, alongside declaring the extra and registering the entry point.

Deterministic seeds

Any source of randomness — sampling, ordering, mock embeddings, fan-out scheduling — is seeded so tests are reproducible run to run. The pipeline accepts a seed (DirectoryPipeline(seed=0)), and tests pass a fixed value. This is what makes the golden-file tests above stable: nondeterministic output would make byte-equality impossible. Never trade reproducibility for speed in a test.

Coverage targets

Coverage is reported in CI and a drop fails the check. New code must not lower coverage.

Area	Minimum coverage
Core (`core/`) and stages (`pipeline/stages/`)	≥ 90%
Adapters (parsers, llm, vlm, embed, store, output)	≥ 80%

The higher bar on core/ and the stages reflects their role: they are the leaf everyone depends on, and the place where a determinism or data-model regression does the most damage. Adapters are thinner (convert at the edge, lazy-import the heavy dependency) and lean on the shared conformance fixture, so the bar is slightly lower but still enforced.

What CI runs

PR checks run the offline unit suite across Python 3.11 / 3.12 / 3.13 (Linux primary; macOS/Windows smoke), plus ruff lint + format-check, mypy and pyright strict, and a clean light-core build. No network, no API keys.
Integration runs separately (integration.yml) against real services (Qdrant, Ollama, …) behind the @pytest.mark.integration gate.
The --live backend tests are a manually triggered / nightly job, never a blocker for merge.

Before opening a PR: pytest (offline) must pass with new code covered, and if you changed the serialized shape you must regenerate goldens deliberately and bump schema_version. The full checklist lives in the coding standards.