Skip to content

Design Principles

Every decision in indx — from the data models to the packaging matrix — serves a small set of values. These are the seven guiding principles that the rest of the codebase exists to protect. When two designs are in tension, the principle wins.

If you are exploring the system top-down, start with the architecture overview; if you are about to contribute code, these principles are codified as enforceable rules in the coding standards.

Behaviour is defined as a typed Protocol before any implementation exists. Code in core/ depends on interfaces — Parser, LLM, VLM, Embedder, Store, OutputWriter, and the Stage protocol — never on a concrete backend. Because indx uses structural typing (typing.Protocol, not abstract base classes), a third-party object that matches a signature “drops straight in” without importing or subclassing anything from indx.

Why it matters: this is the technical backbone of “no lock-in” (PRD goal G4). A new vector store or LLM can be added without touching core/, and the same protocol surface is what the public API and the docs are built on. See the full interface contracts in the protocols reference and the extension recipe in Bring Your Own Stack.

pip install indx must install fast and run air-gapped — with no network and no GPU. The core carries only Typer, Rich, Click, Pydantic v2, and pydantic-settings (TOML parsing is stdlib tomllib). Every heavy or vendor-specific dependency — Docling, Torch, a Qdrant client, a cloud SDK — lives behind an optional extra and is imported lazily inside the method that needs it. A missing dependency never crashes at import time; it raises a MissingDependencyError with the exact pip install indx[...] hint.

Why it matters: this is how the “light core” constraint is enforced at the packaging layer, and it is what makes the default stack genuinely offline-capable (PRD goal G3). The zero-dependency fallbacks — the plaintext parser, the jsonl store, the none VLM, and the .indx + jsonl writers — all ship in core, so a complete run works with no extras installed at all. See the air-gapped guide, the extras reference, and dependency layering.

# Lazy import inside the method, with an actionable error
def connect(self) -> None:
try:
from qdrant_client import QdrantClient
except ModuleNotFoundError as exc:
raise MissingDependencyError(
"The Qdrant store requires the 'qdrant' extra. "
"Install it with: pip install indx[qdrant]"
) from exc
self._client = QdrantClient(url=self.url)

indx never swallows an error or returns a half-built result silently. Everything it raises descends from a single base, IndxError, so callers can catch the library cleanly — with typed subtypes like ConfigError, MissingDependencyError, StageError, ParseError, EmbedError, and StoreError. Every message states what failed, where (which file and stage), and what to do about it.

Why it matters: ingestion runs touch thousands of untrusted files, and a vague ValueError("bad input") three stages downstream is worthless. Actionable, typed errors are a non-functional requirement (NFR-OBS-1) and a direct payoff for regulated, auditable estates. See the full hierarchy and the matching shell statuses in errors and exit codes.

Same inputs + same config + same versions ⇒ a byte-stable index.json. Randomness is seeded, model identifiers are pinned, and provenance — input set, component versions, the embedder name and dim, and the resolved config — is recorded into the archive manifest. Serialization is fixed and diff-friendly: stable field ordering, sorted keys for free-form maps, UTC ISO-8601 timestamps, fixed float representation, and a top-level schema_version.

Why it matters: the product’s core job is producing trustworthy serialized artifacts (PRD goal G5, NFR-DET-1). Golden-file tests assert byte-equality of index.json against a committed fixture, which is the determinism guardrail. Because the embedder’s name and dim live in the manifest, a consumer can detect “these vectors came from a different model” before querying. Inherently non-deterministic LLM enrichment is flagged and seeded where the model permits. See the reproducibility guide, the index.json reference, and the .indx archive format.

Configuration is an explicit, validated, typed Pydantic object. There is no hidden environment-driven behaviour, no implicit globals, and no “it sometimes does X”. Each indx.toml section is sectioned by slot and deserializes into a discriminated-union model keyed on the chosen backend, so an unknown backend or a missing required option fails fast with a precise error at load time. Precedence is explicit: CLI flag > indx.toml > built-in default. Secrets come from environment variables, never the committed file, and are never logged or serialized.

Why it matters: what the config says is exactly what happens — and what gets recorded into the manifest for reproducibility (principle 4). That predictability is what lets you audit and re-create a knowledge space. See the configuration guide and the full configuration reference.

The SDK is the product; the CLI is a thin Typer view over it. Any capability in the CLI exists in the SDK, and vice versa — they never drift. Every command maps to a public SDK call: indx ./docs --out ./ai-ready is exactly DirectoryPipeline().run("./docs", "./ai-ready"). The CLI parses arguments, calls the SDK, and renders the result with Rich; it contains no business logic the SDK lacks. CLI option names mirror config and SDK parameter names, a parity test asserts the mapping, and new features land in the SDK first with the CLI command added in the same change.

Why it matters: notebook, automation, and shell users share one mental model (PRD goal G6), so moving from a prototype to an automated pipeline never means relearning the tool. See the SDK reference and the CLI reference.

# cli.py stays thin — the SDK does the work, Rich does the view
@app.command()
def query(space: Path, text: str, k: int = 5) -> None:
hits = KnowledgeSpace.load(space).search(text, k=k)
render_hits(hits)

Output, errors, and logs are written to be read by both a human reviewer and a model. indx prefers explicit, structured, self-describing data: a .indx archive is a plain Zip with a readable manifest.json and index.json, inspectable with ubiquitous tooling (unzip) and free of Python-only constructs like pickled blobs. No vendor types leak into core models — a Document never stores a raw provider response — so the artifact stays neutral and portable.

Why it matters: legibility is what makes the knowledge space shippable. A teammate, an auditor, or a downstream agent can open the artifact and understand it without indx in the loop, which underpins portability (G5) and the no-lock-in promise (G4). The data models reference and .indx archive reference document exactly what that legible output looks like.

#PrincipleOne-line meaning
1Protocol-firstDepend on typed interfaces, never on concrete backends.
2Dependency-light coreInstall fast, run air-gapped; heavy deps are lazy optional extras.
3Fail loud, with contextTyped IndxError with what failed, where, and the fix.
4DeterminismSame inputs + config + versions ⇒ byte-stable index.json.
5Config is a contractValidated, typed config; no hidden env behaviour.
6SDK = CLI with handlesFull parity; the CLI is a thin view of the SDK.
7Legible outputSelf-describing data, readable by a person and an LLM.