Coding Standards
This page is the enforceable contract for contributing code to indx — written to be followed by a human contributor and an AI coding agent alike. If a rule here conflicts with what you want to do, the rule wins: open an issue to change the rule first.
It condenses the full standards into a navigable reference. For the values behind the rules see design principles; for the interfaces you implement see the protocol reference; for how all of this is verified see testing.
Typing rules
Section titled “Typing rules”- Full type hints are required on every function signature, method, and module-level attribute. Untyped public API is rejected in review.
- Put
from __future__ import annotationsat the top of every module. - Interfaces are
typing.Protocol(structural typing), not ABCs. Any object that fits the shape drops straight in without inheriting from indx. mypy --strictand pyright (strict) must pass with zero errors in CI. No new# type: ignorewithout an inline reason comment.- No bare
Any. Narrow it, or useobjectand validate. Any remainingAnycarries a# Any: <reason>comment justifying it. - Prefer precise types:
Pathoverstrfor paths;Sequence/Mappingfor read-only inputs;Iterator/Iterablefor streams;Literal[...]for fixed option sets;TypedDictor Pydantic models over loose dicts.
from __future__ import annotationsfrom pathlib import Path
# ❌ def parse(self, f): ...# ✅def parse(self, file: Path) -> ParsedDoc: ...Naming conventions
Section titled “Naming conventions”| Thing | Convention | Example |
|---|---|---|
| Modules / packages | short, lowercase, avoid underscores | parsers, store, embed |
| Classes | PascalCase | DirectoryPipeline, DoclingParser, KnowledgeSpace |
| Functions / methods / vars | snake_case | build_graph, chunk_count |
| Constants | UPPER_SNAKE | SCHEMA_VERSION, DEFAULT_CHUNK_SIZE |
| Protocols | noun, no I/Base prefix | Parser, Store, OutputWriter |
| Adapter classes | <Vendor><Slot> | DoclingParser, BGEEmbedder, Qdrant |
| Type vars | short, suffix T | ChunkT |
Registry-key rules
Section titled “Registry-key rules”Registry keys are the public, stable identifiers used in config and on the CLI to resolve a name to an implementation. They are:
- lowercase, hyphen/colon-delimited, and vendor- or model-named — never the Python class name (
engine = "docling", not"DoclingParser"); - versioned by value where it matters:
model = "bge-m3",llm = "ollama:qwen2.5"; - stable — renaming a registry key is a breaking change.
[parser]engine = "docling" # registry key, not "DoclingParser"
[embed]model = "bge-m3"
[store]backend = "qdrant" # → resolves to indx.store.Qdrant via entry pointConfig keys mirror the stage/slot vocabulary exactly (parser, enrich, embed, store, output), are snake_case, and are never abbreviated beyond the public names used in the docs.
Data-model rules (Pydantic v2)
Section titled “Data-model rules (Pydantic v2)”The core domain types — KnowledgeSpace, Document, Chunk, Relation, SpaceContext, ParsedDoc — are Pydantic v2 models (see the data-models reference) and follow these rules:
- Frozen value models. Result/value types (
Chunk,Relation,Document,ParsedDoc) arefrozen=True.SpaceContextis the mutable carrier that flows through stages and is the only exception. - Validate at the boundary. Use field validators and constraints (
Field(..., ge=0), enums/LiteralforRelation.type). Reject bad data on construction, not three stages later. - No business logic in models. Models hold and validate data. Parsing, chunking, embedding, and graph-building live in stages/components — never as methods on
Chunk. A@computed_fieldfor a trivial derived value is fine; an LLM call is not. - No vendor types in core. A
Documentnever stores aqdrant_client.PointStructor a raw provider response. Adapters convert to and from core types at their edge. - Stable, diff-friendly serialization for
index.json: fixed field ordering (sorted keys for free-form maps), a top-levelschema_version, UTC ISO-8601 timestamps, a fixed float representation for vectors, and no Python-only constructs (no pickled blobs) in human-facing JSON.
from __future__ import annotationsfrom typing import Literalfrom pydantic import BaseModel, Field, ConfigDict
class Relation(BaseModel): model_config = ConfigDict(frozen=True) type: Literal["sibling", "parent", "references", "continues", "duplicate-of"] to: str = Field(..., min_length=1)Pipeline stage rules
Section titled “Pipeline stage rules”A Stage is a unit of the pipeline. The six ordered stages (Walk → Parse → Chunk → Relate → Enrich → Embed+Pack) all obey:
- Signature
run(ctx: SpaceContext) -> SpaceContext. A stage receives the shared context, does its work, and returns the same mutated context. Stages communicate only throughSpaceContext— never via globals or side channels. - Idempotent / resume-aware. Re-running a stage on its own output must not corrupt state or duplicate work. Skip work already recorded in the context/cache (content-hash keyed).
- No swallowed errors. A per-file failure either raises (fail-loud) or is recorded as a typed, visible error on the context with enough detail to act on. Never an empty
except. - Emit progress through the shared reporting hook (Rich progress for the CLI, structured logs otherwise).
- Replaceable and optional. A stage must not assume a specific neighbor implementation. Stages can be inserted (e.g. a redaction pass before Enrich) or dropped (e.g. Embed, if the store self-embeds).
# ✅class ChunkStage: name = "chunk" def run(self, ctx: SpaceContext) -> SpaceContext: for doc in ctx.iter_unchunked(): # resume-aware ctx.report(self.name, doc.path) # progress ctx.add_chunks(self._chunk(doc)) return ctx # SAME context
# ❌ swallows errors, no progress, mutates a globaldef run(self, ctx): try: GLOBAL_CHUNKS.extend(...) except Exception: passError handling & logging
Section titled “Error handling & logging”Everything indx raises descends from a single typed base, IndxError, so callers can catch the library cleanly. See errors and exit codes for the full hierarchy and the CLI mapping.
IndxError (base)├─ ConfigError # invalid / contradictory configuration├─ MissingDependencyError # optional extra not installed (carries the pip hint)├─ StageError # a stage failed (carries stage name + offending path)└─ ParseError / EmbedError / StoreError # component-level failures- Raise typed errors with actionable messages — state what failed, where (file/stage), and the fix.
raise ParseError(f"Docling could not parse {path}: {reason}. Try --parser unstructured."), notraise ValueError("bad input"). - Use
logging, notprint. Library code logs to a module logger (logging.getLogger(__name__)). User-facing CLI output uses Rich (progress, tables, panels). Neverprint()from library code. - Levels:
DEBUGinternals,INFOmilestones,WARNINGrecoverable/degraded,ERRORfailures. The CLI maps-v/-qflags to log levels. - Never log secrets. API keys, tokens, and connection strings are redacted in logs, errors, and serialized output. Log the config shape, not its secret values.
CLI ⇄ SDK parity (hard rule)
Section titled “CLI ⇄ SDK parity (hard rule)”Any capability in the CLI must exist in the SDK, and vice versa. The CLI is a thin Typer wrapper that parses arguments, calls the SDK, and renders the result with Rich. It contains no logic the SDK lacks.
- Every CLI command maps to a public SDK call.
indx ./docs --out ./ai-ready⇔DirectoryPipeline().run("./docs", "./ai-ready"). - A new feature lands in the SDK first; the CLI command is added in the same PR.
- CLI option names mirror config/SDK parameter names. No CLI-only behavior, no SDK-only escape hatches the CLI can’t reach.
- A parity test asserts each CLI command has a corresponding SDK entry point.
# cli.py — thin@app.command()def query(space: Path, text: str, k: int = 5) -> None: hits = KnowledgeSpace.load(space).search(text, k=k) # SDK does the work render_hits(hits) # Rich does the viewSee the CLI and SDK references for the matched surfaces.
Docstrings & comments
Section titled “Docstrings & comments”- Google-style docstrings, consistent project-wide.
- Every public API element is documented — modules, public classes, protocols, functions, and config fields. A public symbol without a docstring fails review.
- Docstrings describe behavior, args, returns, raised
IndxErrorsubtypes, and a short example for non-trivial API. - Comments explain why, not what. The code says what it does; the comment says why it had to.
def relate(self, ctx: SpaceContext) -> SpaceContext: """Resolve typed relations between documents.
Args: ctx: The shared space context after chunking.
Returns: The context with `Relation` edges added to the graph.
Raises: StageError: If reference resolution encounters an unreadable document. """ # Resolve siblings before references: references may point at a sibling, # and we want the canonical node to exist first. <- WHY, not WHAT ...