FAQ

Short answers to the questions people ask most about indx. The throughline: indx turns a directory into an AI-ready knowledge space by composing the tools you already use — it does not try to replace them. For deeper dives, follow the links into the rest of the docs.

What is indx, in one sentence?

indx is an open-source Python CLI and SDK that turns a directory of documents into a portable, queryable knowledge space — structure, relationships, and semantic metadata that agents and RAG systems can reason over — in one command. See What is indx and the concepts overview.

What it is not

Is indx a parser?

No. indx is not a parser, and it does not try to build a better PDF or DOCX extractor. It composes parsers — Docling (the default), Unstructured, LlamaParse, MarkItDown, or your own — behind a single Parser slot, then adds the directory-level structure that file parsers throw away: folder lineage, file-to-file relationships, document types, topics, summaries, and a chunk graph.

Does it replace my vector database?

No. indx is not a vector database. It writes to one — Qdrant (default), pgvector, Chroma, LanceDB, or a zero-dependency jsonl store — through the Store slot, and it does not store or serve vectors at runtime. Pick the right backend in Choosing a store.

Is indx a hosted retrieval or agent runtime?

No. indx builds the knowledge space; serving, retrieval at scale, and agent orchestration are downstream concerns. The .indx archive it produces is the handoff point — load it into your own runtime, or feed its outputs to LangChain / LlamaIndex. indx is also not an OCR/embedding model vendor (it calls models through adapters, it does not train them) and not a general ETL framework.

Privacy, offline & data egress

Does it work offline? Does it send my data anywhere?

Yes — there are two distinct offline configurations, and both run with no network egress:

Local profile (opt-in) — a local database: parser=docling, llm=ollama:qwen2.5, vlm=none, embedder=bge-m3 (dim 1024), store=qdrant running in its embedded local mode, output=.indx. This is the recommended offline stack when you want ANN performance; opt in with indx[local].
Fully air-gapped, no-DB — no database at all: swap the store for jsonl (--store jsonl), which inlines vectors into the archive so nothing needs to be installed or served. Ideal for truly air-gapped hosts or small corpora.

The general zero-config defaults use cloud-backed OpenAI model components, so choose the local profile (or the no-DB variant) when no data may leave your machine. There is no telemetry by default; any telemetry would be strictly opt-in.

Data leaves your network only if you explicitly name a cloud component (e.g. --llm openai, a LlamaParse parser, or a Cohere embedder). For both recipes — the embedded-DB local profile and the no-database run — see Local & air-gapped and, for the store trade-off, Choosing a store.

License, cost & versions

What does it cost? What’s the license?

indx is free and open-source under Apache-2.0 — fine for commercial and on-prem use. Install with:

pip install indx

The repository lives at github.com/indx/indx. Install details and optional extras are in Installation.

Which Python versions are supported?

Python 3.11+, specifically tested on 3.11, 3.12, and 3.13. The 3.11 floor lets indx rely on stdlib tomllib, modern typing primitives, and ExceptionGroup/except* without backports. 3.10 and earlier are not supported.

Output & archives

What’s the difference between the `.indx` archive and the expanded output?

A build (indx ./docs --out ./ai-ready) writes both forms side by side:

Form	What it is	When to use it
`handbook.indx`	A single self-contained ZIP archive (manifest, `index.json`, `chunks/`, `embeddings/`)	Ship, diff, move between machines, or `load()` without re-processing
Expanded `./ai-ready/`	The same data unpacked: `index.json`, `chunks/`, `embeddings/`	Inspect or stream the files directly with ordinary tooling

They carry the same knowledge graph — the archive is the portable, sealed version. See the .indx archive reference and the index.json reference.

Tuning the pipeline

Can I skip the LLM enrichment?

Yes — two equivalent ways. From the CLI, pass --llm none. From the SDK, drop the stage entirely:

from indx import DirectoryPipeline

space = (
    DirectoryPipeline(embedder="bge-m3", store="jsonl")
    .drop("enrich")          # no topics/tags/summaries; everything else still runs
    .run("./docs", "./out")
)

You still get walk, parse, chunk, relate, and embed+pack. More on enrichment in Enrichment with LLM & VLM.

Can I run with no database at all?

Yes. Use the jsonl store (--store jsonl), which performs a brute-force scan and inlines vectors so the .indx archive is fully self-contained — no database installed, no server running. It is ideal for air-gapped or small corpora; for non-trivial spaces the default Qdrant ANN index is faster. Details in Choosing a store and Local & air-gapped.

How do I add my own parser, store, or embedder?

Every slot is a typed Protocol, so an implementation needs no base class and no fork. Two paths:

Bring-your-own (in code): pass any object satisfying the protocol to the constructor or .use(...). See Bring your own stack and Custom components.
Plugins (packaged): publish a distribution that advertises a Python entry point under groups like indx.parsers, indx.stores, indx.embedders; indx discovers it at runtime so backend = "myparser" just works. See Authoring a plugin.

You can also add a whole pipeline step — see Custom stage.

Ecosystem & scale

How is indx different from LangChain or LlamaIndex?

They are outputs and consumers, not competitors. indx focuses narrowly on turning a directory into a structured knowledge space; LangChain and LlamaIndex are application frameworks that consume that knowledge to build chains and agents. indx ships langchain and llamaindex output writers precisely so the space you build feeds straight into them. Because every component is swappable, indx also works as a neutral migration layer between stacks. See Output formats and Use cases.

How big a directory can it handle?

indx is designed for 10k+ files. It streams and processes incrementally rather than loading the whole estate into memory, parses files in parallel, and batches embedding and store upserts (the single biggest performance lever). Concurrency and batch sizes are tunable per backend. See Performance.

Are runs reproducible?

Given the same inputs, config, and component versions, runs are reproducible: chunk and document ids are assigned in a deterministic traversal order, enrichment defaults to temperature=0.0, and the resolved config plus the embedder’s name/dim are recorded in the manifest. Inherently non-deterministic LLM steps are flagged. See Reproducibility.

Still stuck?

Get running fast: Quickstart and Your first knowledge space.
Inspect and query a built archive: Inspect & query.
Full command and API surfaces: CLI reference and SDK reference.
Found a bug or want a feature? Open an issue at github.com/indx/indx and see Contributing.