Skip to content

Coding Standards

This page is the enforceable contract for contributing code to indx — written to be followed by a human contributor and an AI coding agent alike. If a rule here conflicts with what you want to do, the rule wins: open an issue to change the rule first.

It condenses the full standards into a navigable reference. For the values behind the rules see design principles; for the interfaces you implement see the protocol reference; for how all of this is verified see testing.

  • Full type hints are required on every function signature, method, and module-level attribute. Untyped public API is rejected in review.
  • Put from __future__ import annotations at the top of every module.
  • Interfaces are typing.Protocol (structural typing), not ABCs. Any object that fits the shape drops straight in without inheriting from indx.
  • mypy --strict and pyright (strict) must pass with zero errors in CI. No new # type: ignore without an inline reason comment.
  • No bare Any. Narrow it, or use object and validate. Any remaining Any carries a # Any: <reason> comment justifying it.
  • Prefer precise types: Path over str for paths; Sequence/Mapping for read-only inputs; Iterator/Iterable for streams; Literal[...] for fixed option sets; TypedDict or Pydantic models over loose dicts.
from __future__ import annotations
from pathlib import Path
# ❌ def parse(self, f): ...
# ✅
def parse(self, file: Path) -> ParsedDoc: ...
ThingConventionExample
Modules / packagesshort, lowercase, avoid underscoresparsers, store, embed
ClassesPascalCaseDirectoryPipeline, DoclingParser, KnowledgeSpace
Functions / methods / varssnake_casebuild_graph, chunk_count
ConstantsUPPER_SNAKESCHEMA_VERSION, DEFAULT_CHUNK_SIZE
Protocolsnoun, no I/Base prefixParser, Store, OutputWriter
Adapter classes<Vendor><Slot>DoclingParser, BGEEmbedder, Qdrant
Type varsshort, suffix TChunkT

Registry keys are the public, stable identifiers used in config and on the CLI to resolve a name to an implementation. They are:

  • lowercase, hyphen/colon-delimited, and vendor- or model-named — never the Python class name (engine = "docling", not "DoclingParser");
  • versioned by value where it matters: model = "bge-m3", llm = "ollama:qwen2.5";
  • stable — renaming a registry key is a breaking change.
[parser]
engine = "docling" # registry key, not "DoclingParser"
[embed]
model = "bge-m3"
[store]
backend = "qdrant" # → resolves to indx.store.Qdrant via entry point

Config keys mirror the stage/slot vocabulary exactly (parser, enrich, embed, store, output), are snake_case, and are never abbreviated beyond the public names used in the docs.

The core domain types — KnowledgeSpace, Document, Chunk, Relation, SpaceContext, ParsedDoc — are Pydantic v2 models (see the data-models reference) and follow these rules:

  • Frozen value models. Result/value types (Chunk, Relation, Document, ParsedDoc) are frozen=True. SpaceContext is the mutable carrier that flows through stages and is the only exception.
  • Validate at the boundary. Use field validators and constraints (Field(..., ge=0), enums/Literal for Relation.type). Reject bad data on construction, not three stages later.
  • No business logic in models. Models hold and validate data. Parsing, chunking, embedding, and graph-building live in stages/components — never as methods on Chunk. A @computed_field for a trivial derived value is fine; an LLM call is not.
  • No vendor types in core. A Document never stores a qdrant_client.PointStruct or a raw provider response. Adapters convert to and from core types at their edge.
  • Stable, diff-friendly serialization for index.json: fixed field ordering (sorted keys for free-form maps), a top-level schema_version, UTC ISO-8601 timestamps, a fixed float representation for vectors, and no Python-only constructs (no pickled blobs) in human-facing JSON.
from __future__ import annotations
from typing import Literal
from pydantic import BaseModel, Field, ConfigDict
class Relation(BaseModel):
model_config = ConfigDict(frozen=True)
type: Literal["sibling", "parent", "references", "continues", "duplicate-of"]
to: str = Field(..., min_length=1)

A Stage is a unit of the pipeline. The six ordered stages (Walk → Parse → Chunk → Relate → Enrich → Embed+Pack) all obey:

  • Signature run(ctx: SpaceContext) -> SpaceContext. A stage receives the shared context, does its work, and returns the same mutated context. Stages communicate only through SpaceContext — never via globals or side channels.
  • Idempotent / resume-aware. Re-running a stage on its own output must not corrupt state or duplicate work. Skip work already recorded in the context/cache (content-hash keyed).
  • No swallowed errors. A per-file failure either raises (fail-loud) or is recorded as a typed, visible error on the context with enough detail to act on. Never an empty except.
  • Emit progress through the shared reporting hook (Rich progress for the CLI, structured logs otherwise).
  • Replaceable and optional. A stage must not assume a specific neighbor implementation. Stages can be inserted (e.g. a redaction pass before Enrich) or dropped (e.g. Embed, if the store self-embeds).
# ✅
class ChunkStage:
name = "chunk"
def run(self, ctx: SpaceContext) -> SpaceContext:
for doc in ctx.iter_unchunked(): # resume-aware
ctx.report(self.name, doc.path) # progress
ctx.add_chunks(self._chunk(doc))
return ctx # SAME context
# ❌ swallows errors, no progress, mutates a global
def run(self, ctx):
try: GLOBAL_CHUNKS.extend(...)
except Exception: pass

Everything indx raises descends from a single typed base, IndxError, so callers can catch the library cleanly. See errors and exit codes for the full hierarchy and the CLI mapping.

IndxError (base)
├─ ConfigError # invalid / contradictory configuration
├─ MissingDependencyError # optional extra not installed (carries the pip hint)
├─ StageError # a stage failed (carries stage name + offending path)
└─ ParseError / EmbedError / StoreError # component-level failures
  • Raise typed errors with actionable messages — state what failed, where (file/stage), and the fix. raise ParseError(f"Docling could not parse {path}: {reason}. Try --parser unstructured."), not raise ValueError("bad input").
  • Use logging, not print. Library code logs to a module logger (logging.getLogger(__name__)). User-facing CLI output uses Rich (progress, tables, panels). Never print() from library code.
  • Levels: DEBUG internals, INFO milestones, WARNING recoverable/degraded, ERROR failures. The CLI maps -v/-q flags to log levels.
  • Never log secrets. API keys, tokens, and connection strings are redacted in logs, errors, and serialized output. Log the config shape, not its secret values.

Any capability in the CLI must exist in the SDK, and vice versa. The CLI is a thin Typer wrapper that parses arguments, calls the SDK, and renders the result with Rich. It contains no logic the SDK lacks.

  • Every CLI command maps to a public SDK call. indx ./docs --out ./ai-readyDirectoryPipeline().run("./docs", "./ai-ready").
  • A new feature lands in the SDK first; the CLI command is added in the same PR.
  • CLI option names mirror config/SDK parameter names. No CLI-only behavior, no SDK-only escape hatches the CLI can’t reach.
  • A parity test asserts each CLI command has a corresponding SDK entry point.
# cli.py — thin
@app.command()
def query(space: Path, text: str, k: int = 5) -> None:
hits = KnowledgeSpace.load(space).search(text, k=k) # SDK does the work
render_hits(hits) # Rich does the view

See the CLI and SDK references for the matched surfaces.

  • Google-style docstrings, consistent project-wide.
  • Every public API element is documented — modules, public classes, protocols, functions, and config fields. A public symbol without a docstring fails review.
  • Docstrings describe behavior, args, returns, raised IndxError subtypes, and a short example for non-trivial API.
  • Comments explain why, not what. The code says what it does; the comment says why it had to.
def relate(self, ctx: SpaceContext) -> SpaceContext:
"""Resolve typed relations between documents.
Args:
ctx: The shared space context after chunking.
Returns:
The context with `Relation` edges added to the graph.
Raises:
StageError: If reference resolution encounters an unreadable document.
"""
# Resolve siblings before references: references may point at a sibling,
# and we want the canonical node to exist first. <- WHY, not WHAT
...