Skip to content

Choosing a Parser

The Parser slot is stage 02 of the pipeline: it turns each file into a normalized ParsedDoc — text, structural blocks, tables, and image refs — that every later stage consumes. indx composes parsers; it does not replace them, so picking the right one is mostly about matching your corpus and your constraints (local-first, light core, no lock-in) to the strengths of each backend.

The default is Docling, the best all-round local choice. This guide explains the alternatives, how to select one, and which extra to install.

Every parser implements the same Parser protocol (parse(file) -> ParsedDoc), so the pipeline never knows or cares which one is active. That means you can switch freely without touching anything downstream.

ParserStrengthsBest forLocal?Extra
Docling (default)High-fidelity layout, reading order, tables, and figures from PDF/Office docs; richest structured output of the local options; permissive (Apache-friendly) license.The default, offline, document-heavy knowledge spaces.Yesindx[docling]
UnstructuredVery broad file-type coverage and a mature partitioning ecosystem.Heterogeneous corpora with many odd/long-tail formats.Yesindx[unstructured]
LlamaParseExcellent on complex/messy PDFs via a hosted service.Hard documents where cloud quality is worth it.No (cloud)indx[llamaparse]
MarkItDownLightweight, fast Markdown conversion with minimal dependencies.Quick/simple conversions and the lightest local install.Yesindx[markitdown]
plaintextZero-dependency fallback that ships in core; reads text as-is.Air-gapped runs, plain-text/Markdown corpora, CI smoke tests.Yesnone (in core)

The parser slot is resolved with the same precedence as every other component:

explicit code argument / use() > CLI flag > indx.toml > documented default

Use --parser to override per run:

Terminal window
# Default (Docling) — requires indx[docling]
indx ./docs --out ./ai-ready
# Lightest local parser
indx ./docs --out ./ai-ready --parser markitdown
# Broad format coverage
indx ./docs --out ./ai-ready --parser unstructured

The parser lives under the [parser] section, keyed engine:

[parser]
engine = "docling" # docling | unstructured | llamaparse | markitdown | plaintext

See the full configuration reference for the complete key table and precedence rules.

For a fully custom or pre-configured parser, pass an instance to the DirectoryPipeline — either at construction or via use():

from indx import DirectoryPipeline
# By name
pipeline = DirectoryPipeline(parser="markitdown")
# By object (anything satisfying the Parser protocol)
from my_pkg import MyMarkdownParser
pipeline = DirectoryPipeline().use(parser=MyMarkdownParser())
space = pipeline.run("./docs", "./ai-ready")

Writing your own parser is covered in Bring-your-own components.

Parsers are optional extras so the core install stays small. Each backend pulls its own heavy dependencies:

InstallEnables
pip install indx[docling]DoclingParser (default)
pip install indx[unstructured]UnstructuredParser
pip install indx[llamaparse]LlamaParseParser
pip install indx[markitdown]MarkItDownParser
pip install indx[defaults]The full local-first stack (docling + ollama + bge + qdrant)

The plaintext parser needs no extra — it is always available.

indx is designed around four product principles. Each parser trades against them differently:

Docling, Unstructured, MarkItDown, and plaintext all run fully local with no network calls. LlamaParse is cloud-only — it sends documents to a hosted service and needs an API key, which breaks the local profile. Reach for it only when cloud quality on genuinely hard PDFs is worth crossing that line; see Local & air-gapped.

pip install indx carries no parser beyond the in-core plaintext fallback. Among the optional parsers, MarkItDown is the lightest (minimal deps, fast Markdown conversion) and is the recommended choice when install size matters most. Docling is the heaviest local option — on some configurations it pulls models/Torch — which is exactly why it ships as an extra rather than in core.

Every parser sits behind the same Parser protocol, and a third party can publish a new one (advertised via the indx.parsers entry point) that works by name with no fork of indx. Swapping parsers is always a config change, never a rewrite. See authoring a plugin.

Chunk, Relate, and Enrich are only as good as the structure the parser preserves. Docling produces the richest structured output of the local options (reading order, tables, figures), so it generally yields the best chunking and relating. MarkItDown is local and light but loses structure Docling keeps; plaintext keeps none at all.