Skip to content

01 · Walk

Walk is the first of the six pipeline stages. It turns a directory (or a .zip) into a navigable structure: it enumerates files, builds a directory graph, captures folder lineage and basic file metadata, and detects each file’s type — laying the groundwork every later stage depends on.

Walk is a built-in stage — it has no swappable component slot of its own. Where Parse delegates to the Parser protocol and Embed+Pack delegates to the Embedder, Store, and OutputWriter protocols, Walk is pure orchestration that ships in the dependency-light core. It runs with zero extras installed.

Its job, per FR-S01 (priority P0 — MVP / must-have), is to:

  • Traverse the input directory or ZIP, discovering every candidate file.
  • Build the directory graph (dir_graph) — a folder → children mapping that records the shape of the estate.
  • Capture folder lineage (root → leaf ancestry) and basic file metadata for each file.
  • Detect file types so downstream stages know what they are dealing with.
  • Honour include/exclude filters so ignored paths never enter the pipeline.

This is the stage that realizes indx’s core thesis: most knowledge does not live in a single file, it lives in the arrangement of files. Walk preserves that arrangement instead of flattening it away.

Every stage receives and returns the same shared, mutable SpaceContext, obeying the contract run(ctx: SpaceContext) -> SpaceContext and returning the same object it received, mutated in place. Walk is the first writer into that context.

FieldWalk’s contribution
ctx.dir_graphdict[str, Any] — folder → children plus detected types. This is Walk’s primary output.
ctx.documentsThe beginnings of the Document graph: folder, lineage, and path are populated here. Enrichment fields (type, topics, tags, summary) are filled in later stages.

dir_graph carries the folder structure forward so that Relate can derive sibling and parent edges, and so that every Document can record its folder lineage. The Document objects Walk seeds are progressively enriched across stages 01–05.

Walk does not parse anything itself. Instead it produces, for each discovered file, the resolved description that the Parse stage consumes. The Parser protocol is defined as:

@runtime_checkable
class Parser(Protocol):
"""Converts a single file into a ParsedDoc. Default: docling."""
def parse(self, file: "FileRef") -> ParsedDoc: ...

A FileRef carries everything Walk resolved:

  • the resolved path of the file,
  • a bytes accessor so the parser reads content without re-walking,
  • the detected MIME / type,
  • the folder lineage captured during traversal.

This keeps the responsibilities clean: Walk decides what exists and where it sits; the parser decides what it says. Because FileRef is a plain core type, a custom parser or custom component receives exactly the same handoff a built-in parser does.

Walk honours include/exclude filters so that ignored paths (build artifacts, lockfiles, vendored directories) never reach Parse. Filtering at the Walk boundary keeps the working set small, which matters for the large-directory performance goals: indx streams files as an iterator rather than materializing a 10k-file estate into memory, so a 2 GB folder does not require 2 GB of RAM.

Security: treat scanned directories as untrusted

Section titled “Security: treat scanned directories as untrusted”

A scanned directory — and especially a ZIP — is untrusted input. Walk is the boundary where that input first enters indx, so it is where the guards live:

These are non-negotiable contributor rules, not optional hardening. Any code that walks a directory or expands a ZIP must apply them.

Walk traverses in a deterministic order — folder lineage, then path. This ordering is what later makes chunk and document ids stable: ids are assigned by traversal order, so re-running over unchanged input yields identical ids and a byte-stable index.json. Parallel per-folder traversal, where used, never affects the final ordering — results are re-sorted into the canonical order before any id is assigned. See reproducibility for the full guarantee.

#StageComponent
01Walk— (built-in)
02ParseParser
03Chunk— (built-in)
04Relate— (built-in)
05EnrichLLM, VLM
06Embed+PackEmbedder, Store, OutputWriter

Walk feeds its dir_graph, seeded Documents, and per-file FileRefs straight into 02 Parse, which runs each file through the configured parser. For the full stage contract and the SpaceContext shape, see the pipeline overview and the data models reference.