Repository Architecture
This page explains how the indx repository is organized and the principles that keep it composable and cheap to install. The layout exists to make extension cheap (drop in a new parser or store) and installation cheap (the core stays tiny; heavy or cloud dependencies are opt-in extras).
Layout principles
Section titled “Layout principles”indx follows six rules that shape every directory and import in the tree.
-
src-layout. All importable code lives under
src/indx/. This prevents the repo root from accidentally shadowing the installed package during tests, forces the test suite to run against the installed distribution, and keeps build backends, linters, and coverage unambiguous. -
Core stays dependency-light.
indx.corecontains only the domain model —KnowledgeSpace,Document,Chunk,Relation,SpaceContext,ParsedDoc— built on Pydantic v2 and the Python standard library. A barepip install indxmust work in an air-gapped, local-first environment with no cloud SDKs, no vector-DB clients, and no heavy parsing toolchains. Everything expensive is an optional extra. -
Protocol-in-core / impl-in-subpackage. Every swappable slot defines its contract as a
typing.Protocolin that sub-package’sbase.py. Concrete implementations live in sibling modules and import only the protocol — never each other. This keeps the dependency graph a DAG pointing inward towardcore/and the protocols. -
Optional extras for heavy/optional dependencies. Parsers, vector DBs, and cloud LLM/embedding SDKs are declared as
pyproject.tomloptional dependency groups. Importing an implementation whose extra is not installed raises a single, actionable error pointing at the rightpip install indx[...]. -
Plugin discovery via entry points. First-party implementations are wired through the internal
registry/. Third-party packages register additional parsers, stores, or embedders by advertising Python entry points under groups such asindx.parsers,indx.stores, andindx.embedders. The registry discovers them lazily at runtime, so installing a plugin package is enough to makeparser = "my-parser"work inindx.toml— no edits to indx itself. -
Cloud defaults, local profile. Out of the box:
parser=docling,llm=openai:gpt-5-mini,vlm=none,embedder=openai:text-embedding-3-small,store=qdrant,output=.indx. The local profile swaps inollama:qwen2.5andbge-m3without changing the slot model. Names resolve through the registry; swapping a slot is a one-line config change.
Top-level project files
Section titled “Top-level project files”The repository root holds packaging and project metadata; nothing importable lives there.
| File / dir | Purpose |
|---|---|
pyproject.toml | Build metadata, deps, optional extras, entry-point groups, and tool config |
README.md | Project overview, quickstart, install matrix |
LICENSE | Apache-2.0 |
NOTICE | Apache-2.0 attribution notices |
CHANGELOG.md | Human-readable release notes |
CONTRIBUTING.md | Dev setup, how to add a parser/store, plugin guide |
.pre-commit-config.yaml | ruff + mypy + formatting hooks |
.github/workflows/ | CI: ci.yml (lint, type-check, unit tests across a Python-version matrix), integration.yml (opt-in tests against real Qdrant/Ollama services), release.yml (build + publish to PyPI on tag) |
tests/ | Mirrors src/indx/ one-to-one (see Testing) |
docs/, examples/ | Project documentation and runnable examples (sample dirs, indx.toml configs, a plugin template) |
The src/indx/ tree
Section titled “The src/indx/ tree”Every sub-package follows the protocol-in-base.py, implementations-alongside convention. Slot defaults and the zero-dependency fallbacks are called out inline.
src/└── indx/ ├── __init__.py # Public SDK surface: re-exports DirectoryPipeline, core types, version ├── py.typed # PEP 561 marker — ships type information ├── _version.py # Single source of version truth │ ├── core/ # Domain model — depends on NOTHING internal. Pydantic v2 + stdlib only. │ ├── errors.py # Shared exception hierarchy (IndxError, MissingDependencyError, ...) │ ├── knowledge_space.py# KnowledgeSpace: root aggregate over Documents/Chunks/Relations │ ├── document.py # Document: a source file + metadata + provenance │ ├── chunk.py # Chunk: a retrievable unit, its text, span, and embedding slot │ ├── relation.py # Relation: typed edge between Documents/Chunks (references, parent, ...) │ ├── context.py # SpaceContext: shared run-state threaded through pipeline stages │ └── parsed.py # ParsedDoc: normalized parser output (blocks, tables, images, text) │ ├── pipeline/ # Orchestration of the 6 stages over a SpaceContext. │ ├── pipeline.py # DirectoryPipeline: builds stages from config and runs them in order │ ├── stage.py # Stage Protocol: run(ctx: SpaceContext) -> SpaceContext │ └── stages/ # One file per stage; each is a replaceable Stage. │ ├── walk.py # 01 Walk: discover files -> Documents (respects ignores) │ ├── parse.py # 02 Parse: run the configured parser -> ParsedDoc per Document │ ├── chunk.py # 03 Chunk: split ParsedDoc into Chunks (size/overlap/structure-aware) │ ├── relate.py # 04 Relate: derive Relations between Documents/Chunks │ ├── enrich.py # 05 Enrich: LLM/VLM metadata, summaries, captions, keywords │ └── pack.py # 06 Embed+Pack: embed Chunks, write via store + output writer │ ├── parsers/ # Parser implementations. Slot default = docling. │ ├── base.py # Parser Protocol: parse(Document) -> ParsedDoc │ ├── docling.py # DoclingParser [extra: docling] │ ├── unstructured.py # UnstructuredParser [extra: unstructured] │ ├── llamaparse.py # LlamaParseParser [extra: llamaparse] │ ├── markitdown.py # MarkItDownParser [extra: markitdown] │ └── plaintext.py # PlainTextParser — zero-dep fallback, ships in core │ ├── llm/ # Text LLM clients. Slot default = openai:gpt-5-mini. │ ├── base.py # LLM Protocol: complete(prompt, **opts) / chat(messages) │ ├── ollama.py # OllamaLLM — local-profile default [extra: ollama] │ ├── vllm.py # VLLMClient — local/served [extra: vllm] │ ├── openai.py # OpenAILLM [extra: openai] │ ├── anthropic.py # AnthropicLLM [extra: anthropic] │ └── azure.py # AzureOpenAILLM [extra: azure] │ ├── vlm/ # Vision-language clients for figures/images. Default = none. │ ├── base.py # VLM Protocol: describe(image, prompt) -> str │ ├── none.py # NoOpVLM — default; skips vision enrichment (ships in core) │ ├── qwen_vl.py # QwenVLClient [extra: qwen-vl] │ ├── gpt4o.py # GPT4oVLM [extra: openai] │ └── local.py # LocalVLM — local served endpoint [extra: vlm-local] │ ├── embed/ # Embedders. Slot default = openai:text-embedding-3-small. │ # Note: package/config key is `embed`; protocol/entry-point group is `Embedder`/`indx.embedders`. │ ├── base.py # Embedder Protocol: embed(texts) -> list[vector]; dim property │ ├── bge_m3.py # BGEM3Embedder — local-profile default [extra: bge] │ ├── e5.py # E5Embedder [extra: e5] │ ├── openai.py # OpenAIEmbedder [extra: openai] │ └── cohere.py # CohereEmbedder [extra: cohere] │ ├── store/ # Vector stores. Slot default = qdrant. │ ├── base.py # Store Protocol: upsert(chunks) / search(vector, k) / delete │ ├── qdrant.py # QdrantStore — default [extra: qdrant] │ ├── pgvector.py # PgVectorStore [extra: pgvector] │ ├── chroma.py # ChromaStore [extra: chroma] │ ├── lancedb.py # LanceDBStore [extra: lancedb] │ └── jsonl.py # JsonlStore — zero-dep fallback, ships in core │ ├── output/ # Output writers (serialize a KnowledgeSpace for downstream use). │ ├── base.py # OutputWriter Protocol: write(KnowledgeSpace, dest) -> None │ ├── indx_writer.py # IndxWriter — default; delegates to archive/ (.indx) │ ├── jsonl_writer.py # JsonlWriter — newline-delimited Documents/Chunks (ships in core) │ ├── langchain.py # LangChainWriter — emit LC Documents [extra: langchain] │ └── llamaindex.py # LlamaIndexWriter — emit LI Nodes [extra: llamaindex] │ ├── archive/ # The .indx on-disk format (read + write). │ ├── format.py # .indx container spec: manifest, version, layout constants │ ├── reader.py # Read a .indx archive back into a KnowledgeSpace │ └── writer.py # Serialize a KnowledgeSpace into a .indx archive │ ├── config/ # indx.toml loading + validation. │ ├── schema.py # Pydantic v2 models for the full indx.toml schema + defaults │ └── loader.py # Find/merge indx.toml + CLI overrides + env -> validated Config │ ├── registry/ # name -> class resolution and plugin loading. │ ├── __init__.py # get_parser/get_llm/get_store/... resolver entry points │ ├── registry.py # Registry: maps slot+name -> class; raises MissingDependencyError on gaps │ ├── builtins.py # Registration table for first-party impls (lazy imports) │ └── plugins.py # Discover third-party impls via importlib.metadata entry points │ ├── cli/ # Typer + Rich command-line interface. │ ├── app.py # Typer app object; wires subcommands; --version callback │ ├── main.py # `indx <dir> --out`: build + run a DirectoryPipeline │ ├── inspect.py # `indx inspect`: summarize a .indx / knowledge space (Rich tables) │ ├── query.py # `indx query`: embed a query, search the store, render hits │ └── _render.py # Shared Rich rendering helpers (tables, progress, panels) │ └── utils/ # Cross-cutting helpers — no domain logic, no internal deps beyond errors. ├── logging.py # Structured logging setup ├── hashing.py # Stable content hashing for dedup/provenance ├── io.py # Path/encoding helpers, safe file walking └── lazy.py # require_extra() helper for friendly missing-dependency errorsProtocol vs. implementation
Section titled “Protocol vs. implementation”Every swappable slot pairs a Protocol defined in core (or that slot’s base.py) with one or more implementation modules that satisfy it structurally. At a high level:
| Slot | Protocol | Default implementation |
|---|---|---|
| Pipeline stage | Stage | the six modules under pipeline/stages/ |
| Parser | Parser | parsers/docling.py |
| LLM | LLM | llm/openai.py (openai:gpt-5-mini) |
| VLM | VLM | vlm/none.py |
| Embedder | Embedder | embed/openai.py (text-embedding-3-small) |
| Store | Store | store/qdrant.py |
| Output writer | OutputWriter | output/indx_writer.py |
All base.py modules define typing.Protocol classes that depend only on core. Implementations are registered by name in registry/builtins.py, so config values like parser = "docling" resolve directly to a class. For full signatures and the complete implementation list, see the protocols reference and registry and defaults.
Why it’s shaped this way
Section titled “Why it’s shaped this way”The dependency graph is a DAG pointing inward: core/ is the sink that imports nothing internal, and every implementation depends only on its slot’s protocol and on core — never on a sibling or another slot. The registry/ is the only place allowed to import concrete implementation classes, and it does so lazily, so a missing extra never breaks an unrelated code path. This is what lets a bare install run offline while a pip install indx[all] adds heavy backends without touching the core.
Where to go next
Section titled “Where to go next”- Dependencies — the runtime core, optional extras, and what each one pulls in.
- Design principles — the values (protocol-first, dependency-light, fail-loud, deterministic) behind these rules.
- Contributing overview — how to add a backend or stage within this layout.