02 · Parse
Parse is the second stage of the pipeline. It takes the directory graph that Walk discovered and runs every file through the configured Parser slot, producing one normalized ParsedDoc per document. This is where indx earns its tagline: it composes best-in-class parsers behind a single typed interface — it never reimplements PDF, Office, or layout extraction itself.
What this stage does
Section titled “What this stage does”For each file in the directory graph, the Parse stage:
- Resolves a
FileRef— the path, a bytes accessor, the detected MIME/type, and the folder lineage produced by stage 01. - Calls
Parser.parse(file)on the configured parser (default:docling). - Stores the returned
ParsedDocin the shared context underctx.parsed, keyed bydoc_id.
Like every stage, Parse obeys the Stage contract — run(ctx: SpaceContext) -> SpaceContext — mutating and returning the same SpaceContext it received. Downstream, the Chunk stage reads ctx.parsed to split content with structure intact.
01 Walk 02 Parse 03 Chunk dir graph ──────▶ Parser.parse ──────▶ split + lineage + FileRefs → ParsedDoc (uses blocks/tables)The Parser protocol
Section titled “The Parser protocol”A parser is any object that structurally satisfies the Parser protocol — no base class or import of indx internals required.
from typing import Protocol, runtime_checkablefrom indx import ParsedDoc
@runtime_checkableclass Parser(Protocol): """Converts a single file into a ParsedDoc. Default: docling.""" def parse(self, file: "FileRef") -> ParsedDoc: ...FileRef carries the resolved path, a bytes accessor, the detected MIME/type, and the folder lineage produced by stage 01. See the protocols reference for the full contract and the custom components guide for a bring-your-own-parser walkthrough.
The ParsedDoc it returns
Section titled “The ParsedDoc it returns”ParsedDoc is the raw, pre-chunking output of a parser for one source file. It is a Pydantic v2 model with these fields:
| Field | Type | Description |
|---|---|---|
source | Source | Provenance: path, folder, and detected type. |
text | str | Normalized full-text rendering of the document. |
blocks | list[dict[str, Any]] | Structural blocks — headings, paragraphs, list items. |
tables | list[dict[str, Any]] | Extracted tables (rows / cells / markdown). |
images | list[dict[str, Any]] | Image refs and captions, used later by VLM enrichment. |
metadata | dict[str, Any] | Parser-supplied raw metadata (title, author, page count, …). |
Only source and text are required; blocks, tables, images, and metadata default to empty and are populated to whatever depth the active parser supports. A minimal parser can return just normalized text; a rich parser like Docling fills in layout, reading order, tables, and figures so that downstream stages get high-quality structure to work with.
from indx import ParsedDoc, Source
ParsedDoc( source=Source(path="policies/data/retention.pdf", folder="policies/data", type="policy"), text="Enterprise data is retained for 90 days…", blocks=[{"kind": "heading", "text": "Retention"}], tables=[], images=[], metadata={"page_count": 4, "title": "Data Retention Policy"},)Available parsers
Section titled “Available parsers”Each non-fallback parser ships as an optional extra so the core install stays light. Select one with --parser, the [parser] engine key in indx.toml, or DirectoryPipeline(parser=...).
| Name | Class | Install | Best for |
|---|---|---|---|
docling (default) | DoclingParser | pip install indx[docling] | High-fidelity layout, reading order, tables, and figures from PDFs/Office docs — fully local. |
unstructured | UnstructuredParser | pip install indx[unstructured] | Heterogeneous corpora with many odd file formats. |
llamaparse | LlamaParseParser | pip install indx[llamaparse] | Hard, messy PDFs via a hosted service (cloud). |
markitdown | MarkItDownParser | pip install indx[markitdown] | Lightweight, fast Markdown conversion; the lightest local install. |
plaintext | PlainTextParser | ships in core | Zero-dependency fallback so a run works offline. |
See choosing a parser for a deeper comparison and decision guide.
Concurrency
Section titled “Concurrency”Parse is embarrassingly parallel across files. A worker pool of size --jobs (default = CPU count) runs Parser.parse concurrently; results are merged back into ctx.parsed keyed by doc_id.
- Blocking / native parsers (Docling, Unstructured wrap native code) run in a bounded thread pool that keeps the run responsive.
- GIL-bound, CPU-heavy parsers may run in a process pool to sidestep the GIL — at the cost of pickling
ParsedDocs across process boundaries.
Concurrency never affects output: parallel results are re-sorted into deterministic order (folder lineage, then path) before any chunk/document ids are assigned, so reruns over unchanged input yield identical ids. See performance and reproducibility for tuning and guarantees.
# Parse (and embed) with 8 workersindx ./docs --out ./ai-ready --jobs 8Error model
Section titled “Error model”Parse uses per-item skip as its default failure mode. When a single file fails to parse, indx does not abort the whole run:
- The file is skipped.
- A
StageError(kind="skip")is appended toctx.errors(recording the stage, the offending item path, and a message). - The pipeline continues with the remaining files.
Skipped items are surfaced on the resulting space under space.metadata["errors"], and the build summary reports the count:
02 parse 127 ok, 1 skippedPassing --strict (or strict=True in the SDK) promotes every skip to a fatal error: the first bad file aborts the run with a PipelineError, and the CLI exits with code 1. Use strict mode in CI or when a corrupt input must never silently disappear. Genuinely fatal conditions — an unresolvable parser name or a missing extra — always abort regardless of strict mode. See errors and exit codes for the full table.
Where it fits
Section titled “Where it fits”| Reads from context | Writes to context |
|---|---|
ctx.dir_graph (from Walk) | ctx.parsed — doc_id → ParsedDoc |
ctx.config, ctx.parser | ctx.errors — per-item skips |
Next, the Chunk stage consumes each ParsedDoc and splits it into retrievable Chunks while preserving the structure that Parse captured.