Choosing a Parser

The Parser slot is stage 02 of the pipeline: it turns each file into a normalized ParsedDoc — text, structural blocks, tables, and image refs — that every later stage consumes. indx composes parsers; it does not replace them, so picking the right one is mostly about matching your corpus and your constraints (local-first, light core, no lock-in) to the strengths of each backend.

The default is Docling, the best all-round local choice. This guide explains the alternatives, how to select one, and which extra to install.

The options at a glance

Every parser implements the same Parser protocol (parse(file) -> ParsedDoc), so the pipeline never knows or cares which one is active. That means you can switch freely without touching anything downstream.

Parser	Strengths	Best for	Local?	Extra
Docling (default)	High-fidelity layout, reading order, tables, and figures from PDF/Office docs; richest structured output of the local options; permissive (Apache-friendly) license.	The default, offline, document-heavy knowledge spaces.	Yes	`indx[docling]`
Unstructured	Very broad file-type coverage and a mature partitioning ecosystem.	Heterogeneous corpora with many odd/long-tail formats.	Yes	`indx[unstructured]`
LlamaParse	Excellent on complex/messy PDFs via a hosted service.	Hard documents where cloud quality is worth it.	No (cloud)	`indx[llamaparse]`
MarkItDown	Lightweight, fast Markdown conversion with minimal dependencies.	Quick/simple conversions and the lightest local install.	Yes	`indx[markitdown]`
plaintext	Zero-dependency fallback that ships in core; reads text as-is.	Air-gapped runs, plain-text/Markdown corpora, CI smoke tests.	Yes	none (in core)

How to select a parser

The parser slot is resolved with the same precedence as every other component:

explicit code argument / use()   >   CLI flag   >   indx.toml   >   documented default

On the CLI

Use --parser to override per run:

# Default (Docling) — requires indx[docling]
indx ./docs --out ./ai-ready

# Lightest local parser
indx ./docs --out ./ai-ready --parser markitdown

# Broad format coverage
indx ./docs --out ./ai-ready --parser unstructured

In `indx.toml`

The parser lives under the [parser] section, keyed engine:

[parser]
engine = "docling"   # docling | unstructured | llamaparse | markitdown | plaintext

See the full configuration reference for the complete key table and precedence rules.

As a `Parser` object (SDK)

For a fully custom or pre-configured parser, pass an instance to the DirectoryPipeline — either at construction or via use():

from indx import DirectoryPipeline

# By name
pipeline = DirectoryPipeline(parser="markitdown")

# By object (anything satisfying the Parser protocol)
from my_pkg import MyMarkdownParser
pipeline = DirectoryPipeline().use(parser=MyMarkdownParser())

space = pipeline.run("./docs", "./ai-ready")

Writing your own parser is covered in Bring-your-own components.

Install the matching extra

Parsers are optional extras so the core install stays small. Each backend pulls its own heavy dependencies:

Install	Enables
`pip install indx[docling]`	DoclingParser (default)
`pip install indx[unstructured]`	UnstructuredParser
`pip install indx[llamaparse]`	LlamaParseParser
`pip install indx[markitdown]`	MarkItDownParser
`pip install indx[defaults]`	The full local-first stack (docling + ollama + bge + qdrant)

The plaintext parser needs no extra — it is always available.

If you select a parser whose extra is not installed, indx raises a single, clear error naming the exact command to run — for example:

MissingDependencyError: parser 'docling' requires: pip install indx[docling]

The error is surfaced only when that slot is actually selected, so an installed-but-unused backend never affects an unrelated run. For the light path, the message points you at pip install indx[markitdown] instead.

Choosing against the four constraints

indx is designed around four product principles. Each parser trades against them differently:

Local-first / air-gapped

Docling, Unstructured, MarkItDown, and plaintext all run fully local with no network calls. LlamaParse is cloud-only — it sends documents to a hosted service and needs an API key, which breaks the local profile. Reach for it only when cloud quality on genuinely hard PDFs is worth crossing that line; see Local & air-gapped.

Light core

pip install indx carries no parser beyond the in-core plaintext fallback. Among the optional parsers, MarkItDown is the lightest (minimal deps, fast Markdown conversion) and is the recommended choice when install size matters most. Docling is the heaviest local option — on some configurations it pulls models/Torch — which is exactly why it ships as an extra rather than in core.

No lock-in

Every parser sits behind the same Parser protocol, and a third party can publish a new one (advertised via the indx.parsers entry point) that works by name with no fork of indx. Swapping parsers is always a config change, never a rewrite. See authoring a plugin.

Output quality for downstream stages

Chunk, Relate, and Enrich are only as good as the structure the parser preserves. Docling produces the richest structured output of the local options (reading order, tables, figures), so it generally yields the best chunking and relating. MarkItDown is local and light but loses structure Docling keeps; plaintext keeps none at all.

Quick decision guide

Component protocols — the Parser contract in full.
Bring-your-own components — write and plug in your own parser.
Extras reference — the complete optional-dependency matrix.
Choosing a store and choosing an embedder — the same decision framework for the other slots.