Choosing a Parser
The Parser slot is stage 02 of the pipeline: it turns each file into a normalized ParsedDoc — text, structural blocks, tables, and image refs — that every later stage consumes. indx composes parsers; it does not replace them, so picking the right one is mostly about matching your corpus and your constraints (local-first, light core, no lock-in) to the strengths of each backend.
The default is Docling, the best all-round local choice. This guide explains the alternatives, how to select one, and which extra to install.
The options at a glance
Section titled “The options at a glance”Every parser implements the same Parser protocol (parse(file) -> ParsedDoc), so the pipeline never knows or cares which one is active. That means you can switch freely without touching anything downstream.
| Parser | Strengths | Best for | Local? | Extra |
|---|---|---|---|---|
| Docling (default) | High-fidelity layout, reading order, tables, and figures from PDF/Office docs; richest structured output of the local options; permissive (Apache-friendly) license. | The default, offline, document-heavy knowledge spaces. | Yes | indx[docling] |
| Unstructured | Very broad file-type coverage and a mature partitioning ecosystem. | Heterogeneous corpora with many odd/long-tail formats. | Yes | indx[unstructured] |
| LlamaParse | Excellent on complex/messy PDFs via a hosted service. | Hard documents where cloud quality is worth it. | No (cloud) | indx[llamaparse] |
| MarkItDown | Lightweight, fast Markdown conversion with minimal dependencies. | Quick/simple conversions and the lightest local install. | Yes | indx[markitdown] |
| plaintext | Zero-dependency fallback that ships in core; reads text as-is. | Air-gapped runs, plain-text/Markdown corpora, CI smoke tests. | Yes | none (in core) |
How to select a parser
Section titled “How to select a parser”The parser slot is resolved with the same precedence as every other component:
explicit code argument / use() > CLI flag > indx.toml > documented defaultOn the CLI
Section titled “On the CLI”Use --parser to override per run:
# Default (Docling) — requires indx[docling]indx ./docs --out ./ai-ready
# Lightest local parserindx ./docs --out ./ai-ready --parser markitdown
# Broad format coverageindx ./docs --out ./ai-ready --parser unstructuredIn indx.toml
Section titled “In indx.toml”The parser lives under the [parser] section, keyed engine:
[parser]engine = "docling" # docling | unstructured | llamaparse | markitdown | plaintextSee the full configuration reference for the complete key table and precedence rules.
As a Parser object (SDK)
Section titled “As a Parser object (SDK)”For a fully custom or pre-configured parser, pass an instance to the DirectoryPipeline — either at construction or via use():
from indx import DirectoryPipeline
# By namepipeline = DirectoryPipeline(parser="markitdown")
# By object (anything satisfying the Parser protocol)from my_pkg import MyMarkdownParserpipeline = DirectoryPipeline().use(parser=MyMarkdownParser())
space = pipeline.run("./docs", "./ai-ready")Writing your own parser is covered in Bring-your-own components.
Install the matching extra
Section titled “Install the matching extra”Parsers are optional extras so the core install stays small. Each backend pulls its own heavy dependencies:
| Install | Enables |
|---|---|
pip install indx[docling] | DoclingParser (default) |
pip install indx[unstructured] | UnstructuredParser |
pip install indx[llamaparse] | LlamaParseParser |
pip install indx[markitdown] | MarkItDownParser |
pip install indx[defaults] | The full local-first stack (docling + ollama + bge + qdrant) |
The plaintext parser needs no extra — it is always available.
Choosing against the four constraints
Section titled “Choosing against the four constraints”indx is designed around four product principles. Each parser trades against them differently:
Local-first / air-gapped
Section titled “Local-first / air-gapped”Docling, Unstructured, MarkItDown, and plaintext all run fully local with no network calls. LlamaParse is cloud-only — it sends documents to a hosted service and needs an API key, which breaks the local profile. Reach for it only when cloud quality on genuinely hard PDFs is worth crossing that line; see Local & air-gapped.
Light core
Section titled “Light core”pip install indx carries no parser beyond the in-core plaintext fallback. Among the optional parsers, MarkItDown is the lightest (minimal deps, fast Markdown conversion) and is the recommended choice when install size matters most. Docling is the heaviest local option — on some configurations it pulls models/Torch — which is exactly why it ships as an extra rather than in core.
No lock-in
Section titled “No lock-in”Every parser sits behind the same Parser protocol, and a third party can publish a new one (advertised via the indx.parsers entry point) that works by name with no fork of indx. Swapping parsers is always a config change, never a rewrite. See authoring a plugin.
Output quality for downstream stages
Section titled “Output quality for downstream stages”Chunk, Relate, and Enrich are only as good as the structure the parser preserves. Docling produces the richest structured output of the local options (reading order, tables, figures), so it generally yields the best chunking and relating. MarkItDown is local and light but loses structure Docling keeps; plaintext keeps none at all.
Quick decision guide
Section titled “Quick decision guide”Related pages
Section titled “Related pages”- Component protocols — the
Parsercontract in full. - Bring-your-own components — write and plug in your own parser.
- Extras reference — the complete optional-dependency matrix.
- Choosing a store and choosing an embedder — the same decision framework for the other slots.