FAQ
Short answers to the questions people ask most about indx. The throughline: indx turns a directory into an AI-ready knowledge space by composing the tools you already use — it does not try to replace them. For deeper dives, follow the links into the rest of the docs.
What is indx, in one sentence?
Section titled “What is indx, in one sentence?”indx is an open-source Python CLI and SDK that turns a directory of documents into a portable, queryable knowledge space — structure, relationships, and semantic metadata that agents and RAG systems can reason over — in one command. See What is indx and the concepts overview.
What it is not
Section titled “What it is not”Is indx a parser?
Section titled “Is indx a parser?”No. indx is not a parser, and it does not try to build a better PDF or DOCX extractor. It composes parsers — Docling (the default), Unstructured, LlamaParse, MarkItDown, or your own — behind a single Parser slot, then adds the directory-level structure that file parsers throw away: folder lineage, file-to-file relationships, document types, topics, summaries, and a chunk graph.
Does it replace my vector database?
Section titled “Does it replace my vector database?”No. indx is not a vector database. It writes to one — Qdrant (default), pgvector, Chroma, LanceDB, or a zero-dependency jsonl store — through the Store slot, and it does not store or serve vectors at runtime. Pick the right backend in Choosing a store.
Is indx a hosted retrieval or agent runtime?
Section titled “Is indx a hosted retrieval or agent runtime?”No. indx builds the knowledge space; serving, retrieval at scale, and agent orchestration are downstream concerns. The .indx archive it produces is the handoff point — load it into your own runtime, or feed its outputs to LangChain / LlamaIndex. indx is also not an OCR/embedding model vendor (it calls models through adapters, it does not train them) and not a general ETL framework.
Privacy, offline & data egress
Section titled “Privacy, offline & data egress”Does it work offline? Does it send my data anywhere?
Section titled “Does it work offline? Does it send my data anywhere?”Yes — there are two distinct offline configurations, and both run with no network egress:
- Local profile (opt-in) — a local database: parser=
docling, llm=ollama:qwen2.5, vlm=none, embedder=bge-m3(dim 1024), store=qdrantrunning in its embedded local mode, output=.indx. This is the recommended offline stack when you want ANN performance; opt in withindx[local]. - Fully air-gapped, no-DB — no database at all: swap the store for
jsonl(--store jsonl), which inlines vectors into the archive so nothing needs to be installed or served. Ideal for truly air-gapped hosts or small corpora.
The general zero-config defaults use cloud-backed OpenAI model components, so choose the local profile (or the no-DB variant) when no data may leave your machine. There is no telemetry by default; any telemetry would be strictly opt-in.
Data leaves your network only if you explicitly name a cloud component (e.g. --llm openai, a LlamaParse parser, or a Cohere embedder). For both recipes — the embedded-DB local profile and the no-database run — see Local & air-gapped and, for the store trade-off, Choosing a store.
License, cost & versions
Section titled “License, cost & versions”What does it cost? What’s the license?
Section titled “What does it cost? What’s the license?”indx is free and open-source under Apache-2.0 — fine for commercial and on-prem use. Install with:
pip install indxThe repository lives at github.com/indx/indx. Install details and optional extras are in Installation.
Which Python versions are supported?
Section titled “Which Python versions are supported?”Python 3.11+, specifically tested on 3.11, 3.12, and 3.13. The 3.11 floor lets indx rely on stdlib tomllib, modern typing primitives, and ExceptionGroup/except* without backports. 3.10 and earlier are not supported.
Output & archives
Section titled “Output & archives”What’s the difference between the .indx archive and the expanded output?
Section titled “What’s the difference between the .indx archive and the expanded output?”A build (indx ./docs --out ./ai-ready) writes both forms side by side:
| Form | What it is | When to use it |
|---|---|---|
handbook.indx | A single self-contained ZIP archive (manifest, index.json, chunks/, embeddings/) | Ship, diff, move between machines, or load() without re-processing |
Expanded ./ai-ready/ | The same data unpacked: index.json, chunks/, embeddings/ | Inspect or stream the files directly with ordinary tooling |
They carry the same knowledge graph — the archive is the portable, sealed version. See the .indx archive reference and the index.json reference.
Tuning the pipeline
Section titled “Tuning the pipeline”Can I skip the LLM enrichment?
Section titled “Can I skip the LLM enrichment?”Yes — two equivalent ways. From the CLI, pass --llm none. From the SDK, drop the stage entirely:
from indx import DirectoryPipeline
space = ( DirectoryPipeline(embedder="bge-m3", store="jsonl") .drop("enrich") # no topics/tags/summaries; everything else still runs .run("./docs", "./out"))You still get walk, parse, chunk, relate, and embed+pack. More on enrichment in Enrichment with LLM & VLM.
Can I run with no database at all?
Section titled “Can I run with no database at all?”Yes. Use the jsonl store (--store jsonl), which performs a brute-force scan and inlines vectors so the .indx archive is fully self-contained — no database installed, no server running. It is ideal for air-gapped or small corpora; for non-trivial spaces the default Qdrant ANN index is faster. Details in Choosing a store and Local & air-gapped.
How do I add my own parser, store, or embedder?
Section titled “How do I add my own parser, store, or embedder?”Every slot is a typed Protocol, so an implementation needs no base class and no fork. Two paths:
- Bring-your-own (in code): pass any object satisfying the protocol to the constructor or
.use(...). See Bring your own stack and Custom components. - Plugins (packaged): publish a distribution that advertises a Python entry point under groups like
indx.parsers,indx.stores,indx.embedders; indx discovers it at runtime sobackend = "myparser"just works. See Authoring a plugin.
You can also add a whole pipeline step — see Custom stage.
Ecosystem & scale
Section titled “Ecosystem & scale”How is indx different from LangChain or LlamaIndex?
Section titled “How is indx different from LangChain or LlamaIndex?”They are outputs and consumers, not competitors. indx focuses narrowly on turning a directory into a structured knowledge space; LangChain and LlamaIndex are application frameworks that consume that knowledge to build chains and agents. indx ships langchain and llamaindex output writers precisely so the space you build feeds straight into them. Because every component is swappable, indx also works as a neutral migration layer between stacks. See Output formats and Use cases.
How big a directory can it handle?
Section titled “How big a directory can it handle?”indx is designed for 10k+ files. It streams and processes incrementally rather than loading the whole estate into memory, parses files in parallel, and batches embedding and store upserts (the single biggest performance lever). Concurrency and batch sizes are tunable per backend. See Performance.
Are runs reproducible?
Section titled “Are runs reproducible?”Given the same inputs, config, and component versions, runs are reproducible: chunk and document ids are assigned in a deterministic traversal order, enrichment defaults to temperature=0.0, and the resolved config plus the embedder’s name/dim are recorded in the manifest. Inherently non-deterministic LLM steps are flagged. See Reproducibility.
Still stuck?
Section titled “Still stuck?”- Get running fast: Quickstart and Your first knowledge space.
- Inspect and query a built archive: Inspect & query.
- Full command and API surfaces: CLI reference and SDK reference.
- Found a bug or want a feature? Open an issue at github.com/indx/indx and see Contributing.