What is indx?
indx turns a directory of source documents into an AI-ready knowledge space — structure, relationships, and semantic metadata that AI agents and RAG systems can reason over — in a single command. It is open-source (Apache-2.0), written in Python, and shipped as both a CLI and an SDK.
pip install indxindx ./docs --out ./ai-readyThe one-line thesis
Section titled “The one-line thesis”Make directories AI-ready — not just files.
Today’s document-AI tooling is overwhelmingly file-level. Parsers like Docling, Unstructured, LlamaParse, and MarkItDown are excellent at answering “what does this one file say?” — they extract clean text, tables, and layout from a PDF or DOCX.
But in the act of flattening each file to text, they discard everything around the file: the folder it lived in, the documents it sits next to, the report it continues, the contract it references, and what kind of document it is.
indx is built on the file → directory thesis: most real knowledge does not live in a single file — it lives in the arrangement of files. A directory is a graph of related documents with implicit hierarchy and document-type semantics. indx operates at the directory level: it walks the tree, parses every file with the engine you choose, then layers on the things parsers throw away — folder lineage, file-to-file relationships, document-type awareness, topics, summaries, and a chunk graph — and emits a portable, queryable knowledge space.
The result: agents stop seeing a bag of disconnected text fragments and start seeing an organized, navigable knowledge estate.
The problem
Section titled “The problem”Modern RAG and agent stacks ingest documents by running a file parser, splitting the output into chunks, embedding the chunks, and dumping them into a vector store. This parse → split → embed → store pipeline is lossy in ways that directly hurt retrieval and reasoning:
- Structure is destroyed. A folder tree carries meaning (
/contracts/2024/acme/,/handbook/onboarding/). Once each file is flattened to text, the path and hierarchy are gone. - Relationships are invisible. “Section 2 continues in the next file,” “this invoice references that PO,” “these two files are near-duplicates” — none of this survives file-by-file parsing.
- Document type is ignored. A meeting transcript, a legal contract, and an API reference need different handling, but file parsers treat them as undifferentiated text.
- Output is not portable. The enriched knowledge ends up locked inside whichever vector DB or framework was used, with no self-contained artifact to ship, diff, or migrate.
- Everything is hand-wired. Teams glue together parser + splitter + embedder + store themselves, and re-glue it every time a component changes.
File-level tools vs. directory-level indx
Section titled “File-level tools vs. directory-level indx”The shift indx makes is moving the unit of work from a single file to a whole document estate. The dimensions below are the ones that matter most in practice:
| Dimension | File-level parsers (Docling, Unstructured, LlamaParse, MarkItDown) | indx (directory-level) |
|---|---|---|
| Unit of work | A single file | A whole directory / document estate |
| Core question answered | “What does this file say?” | “How does this body of knowledge fit together?” |
| Folder structure | Discarded | Preserved as Document folder lineage |
| File-to-file relationships | None | Typed Relation graph (sibling, parent, references, continues, duplicate-of) |
| Document-type awareness | None / generic | Detected per Document; drives enrichment |
| Semantic metadata | Minimal (raw text + layout) | Topics, tags, summaries, references per document |
| Chunking | Often built-in, isolated | Chunks carry source doc, position, and neighbor links |
| Output | Text / Markdown / JSON per file | Portable KnowledgeSpace → .indx archive |
| Relationship to parsers | is the parser | composes parsers; not a replacement |
To learn the vocabulary behind that table — KnowledgeSpace, Document, Chunk, Relation — see Core objects.
indx composes parsers — it does not replace them
Section titled “indx composes parsers — it does not replace them”This is the single most important thing to internalize: indx does not compete with parsers, it consumes them. A file parser is just one swappable component slot inside indx. You pick the parser that’s best for your files (Docling by default), and indx orchestrates everything else around it.
Concretely, indx is not several things it might be mistaken for:
| indx is NOT… | …because | |---|---| | a parser | It composes Docling / Unstructured / LlamaParse / MarkItDown; it will not build a better PDF extractor. | | a vector database | It writes to Qdrant / pgvector / Chroma / LanceDB / JSONL; it does not store or serve vectors at runtime. | | a hosted retrieval / agent runtime | It builds the knowledge space; serving and orchestration are downstream concerns. | | an OCR / CV / embedding-model vendor | It calls models through adapters; it does not train them. | | a general ETL / data-pipeline framework | Its scope is unstructured document directories → knowledge spaces. |
Because every slot has a zero-dependency fallback that ships in the core — a plaintext parser, a jsonl store, none for the VLM, and .indx + jsonl writers — a complete run works offline and air-gapped out of the box. That makes indx viable in regulated or disconnected environments where SaaS parsers and cloud LLMs are non-starters. See the local & air-gapped guide.
How a run is structured
Section titled “How a run is structured”Under the hood, indx is a pipeline of six ordered, individually-replaceable stages that share a single SpaceContext:
01 Walk → 02 Parse → 03 Chunk → 04 Relate → 05 Enrich → 06 Embed+PackEach stage receives and returns the same mutated SpaceContext, so you can insert, swap, or drop a stage without touching its neighbors. The pipeline is a list you control — not a black box. The pipeline overview walks through each stage in detail.
The whole thing is symmetric across the CLI and SDK, so you keep one mental model whether you live in a shell or a notebook:
from indx import DirectoryPipeline
# Runs end to end with default components (cloud; needs OPENAI_API_KEY)space = DirectoryPipeline().run("./docs", "./ai-ready")
print(space.stats) # counts, timings, components usedfor doc in space.documents(type="contract"): print(doc.path, doc.topics, doc.summary)
hits = space.search("data retention", k=5)Who is it for?
Section titled “Who is it for?”indx is built for people working with messy, real repositories rather than single files:
- RAG / agent engineers who want grounded context with relationships, not a flat chunk soup.
- Enterprise / on-prem platform teams in regulated or air-gapped environments that need fully local, auditable, reproducible ingestion.
- OSS developers and integrators who want a composable, no-lock-in library they can extend with their own parser, store, or output.
- Researchers turning large archives of papers, datasets, and notes into a navigable, citable, shareable knowledge space.
See use cases for the full picture of where indx fits.