indx

Parsers turn one PDF into clean text. indx turns an entire folder into a knowledge space — with structure, relationships, and semantic metadata that AI agents and RAG systems can actually reason over.

Get started Quickstart View on GitHub

pip install indx
indx ./docs --out ./ai-ready

The file → directory thesis

A great parser answers “what does this file say?” An agent searching your knowledge base is asking something harder: “where does this belong, what does it relate to, and what context do I need to trust it?” A folder is not a bag of files — it has a shape. Most tooling throws that shape away. indx keeps the map, and hands it to the agent.

Directory-level

The unit of work is the directory, not the file. Nested trees, ZIPs, and mixed formats flow through one pipeline into one coherent knowledge space.

Relationship-aware

Folder hierarchy, sibling files, and cross-document references become a typed graph — so an agent knows that /contracts/2024/ means something.

Semantic metadata

Document type, topics, tags, and summaries are attached as metadata, so retrieval can filter and reason instead of guessing.

Portable output

A self-contained, versioned .indx archive — equally legible to a person and to an LLM context window. Build it once, ship it anywhere.

Bring your own stack

Parser, LLM, VLM, embedder, vector store, output — every slot is a typed interface with a sensible default. No lock-in, ever.

Local-first

The local profile runs fully offline — local parser, local LLM, local embedder, no-DB output. Air-gapped by default, not as an afterthought.

Start here

What is indx? The problem, the thesis, and how indx composes parsers instead of replacing them.

Installation Install the light core and the optional extras that power each component slot.

Quickstart From pip install to a queryable knowledge space in 60 seconds.

Tutorial Build, inspect, and query your first knowledge space end to end.

Go deeper

Core Concepts KnowledgeSpace, Document, Chunk, Relation — the small mental model the whole API rests on.

The Pipeline Walk → Parse → Chunk → Relate → Enrich → Embed+Pack, one replaceable stage at a time.

Guides Configure components, run air-gapped, write custom stages, author plugins, and tune performance.

Reference CLI, SDK, data models, protocols, the .indx format, and the index.json schema.