Skip to content

02 · Parse

Parse is the second stage of the pipeline. It takes the directory graph that Walk discovered and runs every file through the configured Parser slot, producing one normalized ParsedDoc per document. This is where indx earns its tagline: it composes best-in-class parsers behind a single typed interface — it never reimplements PDF, Office, or layout extraction itself.

For each file in the directory graph, the Parse stage:

  1. Resolves a FileRef — the path, a bytes accessor, the detected MIME/type, and the folder lineage produced by stage 01.
  2. Calls Parser.parse(file) on the configured parser (default: docling).
  3. Stores the returned ParsedDoc in the shared context under ctx.parsed, keyed by doc_id.

Like every stage, Parse obeys the Stage contract — run(ctx: SpaceContext) -> SpaceContext — mutating and returning the same SpaceContext it received. Downstream, the Chunk stage reads ctx.parsed to split content with structure intact.

01 Walk 02 Parse 03 Chunk
dir graph ──────▶ Parser.parse ──────▶ split + lineage
+ FileRefs → ParsedDoc (uses blocks/tables)

A parser is any object that structurally satisfies the Parser protocol — no base class or import of indx internals required.

from typing import Protocol, runtime_checkable
from indx import ParsedDoc
@runtime_checkable
class Parser(Protocol):
"""Converts a single file into a ParsedDoc. Default: docling."""
def parse(self, file: "FileRef") -> ParsedDoc: ...

FileRef carries the resolved path, a bytes accessor, the detected MIME/type, and the folder lineage produced by stage 01. See the protocols reference for the full contract and the custom components guide for a bring-your-own-parser walkthrough.

ParsedDoc is the raw, pre-chunking output of a parser for one source file. It is a Pydantic v2 model with these fields:

FieldTypeDescription
sourceSourceProvenance: path, folder, and detected type.
textstrNormalized full-text rendering of the document.
blockslist[dict[str, Any]]Structural blocks — headings, paragraphs, list items.
tableslist[dict[str, Any]]Extracted tables (rows / cells / markdown).
imageslist[dict[str, Any]]Image refs and captions, used later by VLM enrichment.
metadatadict[str, Any]Parser-supplied raw metadata (title, author, page count, …).

Only source and text are required; blocks, tables, images, and metadata default to empty and are populated to whatever depth the active parser supports. A minimal parser can return just normalized text; a rich parser like Docling fills in layout, reading order, tables, and figures so that downstream stages get high-quality structure to work with.

from indx import ParsedDoc, Source
ParsedDoc(
source=Source(path="policies/data/retention.pdf",
folder="policies/data", type="policy"),
text="Enterprise data is retained for 90 days…",
blocks=[{"kind": "heading", "text": "Retention"}],
tables=[],
images=[],
metadata={"page_count": 4, "title": "Data Retention Policy"},
)

Each non-fallback parser ships as an optional extra so the core install stays light. Select one with --parser, the [parser] engine key in indx.toml, or DirectoryPipeline(parser=...).

NameClassInstallBest for
docling (default)DoclingParserpip install indx[docling]High-fidelity layout, reading order, tables, and figures from PDFs/Office docs — fully local.
unstructuredUnstructuredParserpip install indx[unstructured]Heterogeneous corpora with many odd file formats.
llamaparseLlamaParseParserpip install indx[llamaparse]Hard, messy PDFs via a hosted service (cloud).
markitdownMarkItDownParserpip install indx[markitdown]Lightweight, fast Markdown conversion; the lightest local install.
plaintextPlainTextParserships in coreZero-dependency fallback so a run works offline.

See choosing a parser for a deeper comparison and decision guide.

Parse is embarrassingly parallel across files. A worker pool of size --jobs (default = CPU count) runs Parser.parse concurrently; results are merged back into ctx.parsed keyed by doc_id.

  • Blocking / native parsers (Docling, Unstructured wrap native code) run in a bounded thread pool that keeps the run responsive.
  • GIL-bound, CPU-heavy parsers may run in a process pool to sidestep the GIL — at the cost of pickling ParsedDocs across process boundaries.

Concurrency never affects output: parallel results are re-sorted into deterministic order (folder lineage, then path) before any chunk/document ids are assigned, so reruns over unchanged input yield identical ids. See performance and reproducibility for tuning and guarantees.

Terminal window
# Parse (and embed) with 8 workers
indx ./docs --out ./ai-ready --jobs 8

Parse uses per-item skip as its default failure mode. When a single file fails to parse, indx does not abort the whole run:

  • The file is skipped.
  • A StageError(kind="skip") is appended to ctx.errors (recording the stage, the offending item path, and a message).
  • The pipeline continues with the remaining files.

Skipped items are surfaced on the resulting space under space.metadata["errors"], and the build summary reports the count:

02 parse 127 ok, 1 skipped

Passing --strict (or strict=True in the SDK) promotes every skip to a fatal error: the first bad file aborts the run with a PipelineError, and the CLI exits with code 1. Use strict mode in CI or when a corrupt input must never silently disappear. Genuinely fatal conditions — an unresolvable parser name or a missing extra — always abort regardless of strict mode. See errors and exit codes for the full table.

Reads from contextWrites to context
ctx.dir_graph (from Walk)ctx.parseddoc_id → ParsedDoc
ctx.config, ctx.parserctx.errors — per-item skips

Next, the Chunk stage consumes each ParsedDoc and splits it into retrievable Chunks while preserving the structure that Parse captured.