04 · Relate

Stage 04 is where indx earns its tagline. After Walk, Parse, and Chunk have produced documents and chunks, Relate resolves the typed edges between them — turning a flat pile of text into a navigable graph. This is the structure that file-level parsers throw away, and it is the heart of indx’s value.

Relate is a built-in stage (it has no swappable component slot). It reads the directory graph, document lineage, and chunk neighbor links already in the SpaceContext and writes Relation edges back into that same context.

What Relate produces

The output of this stage is a set of Relation edges. A Relation is a typed, directed edge that may connect a document to a document, a chunk to a chunk, or a chunk to a document. Edges are attached where they belong — on a chunk’s relations list, on a document’s references / referenced_by lists — and may also be mirrored at the space level in ctx.relations.

{ "type": "references", "to": "legal/gdpr.md" }

{ "type": "duplicate-of", "to": "doc_0041", "weight": 0.94 }

Like every stage, Relate obeys the contract run(ctx: SpaceContext) -> SpaceContext and returns the same mutated context. See the pipeline overview for how stages are sequenced and how to insert, replace, or drop one.

The Relation model

A Relation has four fields. Only type and to are required.

Field	Type	Required	Meaning
`type`	`RelationType`	yes	The kind of edge (see below).
`to`	`string`	yes	Target id or path, e.g. `chunk_0482`, `doc_0007`, or `legal/gdpr.md`.
`from_id`	`string` (optional)	no	Source id. Omitted when the edge is stored on its owning object — the owner is the implicit source.
`weight`	`float` (optional)	no	Confidence or similarity score in the range `[0, 1]`, where applicable.

weight is meaningful for inferred edges — references resolved by similarity, or duplicate-of detected by near-match — and is typically omitted for structural edges like sibling and parent, which are facts about the directory rather than guesses.

The five relation types

RelationType is a string enum with exactly five members. New members may be added in future versions; consumers must ignore unknown values rather than fail.

Type	Value	Connects	Derived from
`sibling`	`sibling`	document ↔ document	Same folder / same logical group.
`parent`	`parent`	document → document	Folder lineage / containment.
`references`	`references`	doc/chunk → doc/chunk	Outgoing citation, link, or mention.
`continues`	`continues`	chunk → chunk	Next unit in a split sequence.
`duplicate-of`	`duplicate-of`	doc/chunk → doc/chunk	Near or exact duplicate content.

sibling

Two documents that live in the same folder — or otherwise belong to the same logical group — are siblings. This is what lets an agent answer “what sits next to this?” When a query matches the onboarding doc, its siblings (the rest of the onboarding folder) are one hop away, instead of being scattered across an undifferentiated vector index.

sibling is purely structural: it falls out of the directory graph built in Walk, so it is high-precision and cheap.

parent

parent captures folder lineage and containment — the ancestry that runs root → leaf. A document under policies/data/retention.pdf has policies/data and policies in its lineage, and parent edges make that hierarchy traversable as graph edges, not just as a lineage array on the Document. This is how an agent can scope reasoning to “only contracts under /2024/acme.”

Like sibling, parent is derived directly from the walked tree and is therefore reliable.

continues

When a single logical unit is split — a long section spread across chunks, or content that spills from one file into the next — continues records the next unit in the sequence. It complements the chunk-level neighbors links produced in Chunk: neighbors are the immediate previous/next chunk ids; continues is the typed edge that lets retrieval follow a run of content forward as a first-class relation.

This lets an agent expand a retrieved fragment into the full passage instead of answering from an orphaned slice.

references

references is an outgoing citation, link, or mention — “this invoice references that PO,” “this guide links to the GDPR policy.” It is the edge that connects documents across folders, and it is the one most likely to be inferred rather than read off the tree. Resolution may draw on explicit links and filename mentions, and (where enabled) embedding similarity, which is why references edges often carry a weight.

References are stored on a document’s outgoing references list, and the reverse edge is recorded on the target’s referenced_by list (see below).

duplicate-of

duplicate-of flags near or exact duplicate content — two versions of the same contract, a file copied between folders, an export that overlaps an original. It is almost always an inferred, scored edge: expect a weight close to 1.0 for high-confidence duplicates.

This is what lets a research archive dedupe versions, and lets retrieval avoid returning five copies of the same passage.

Priority: what ships when

The relation types are not all equal in maturity. indx ships the structural edges first and treats inference as a quality-sensitive follow-on.

Tier	Relation types	Status
P0 (MVP)	`sibling`, `parent`, `continues`	Structural — derived from the directory and chunk graph.
P1 (v1.0)	`references`, `duplicate-of`	Inferred — link/mention extraction and similarity.

The split is deliberate. sibling, parent, and continues are facts about the directory and the chunking, so they are dependable from day one. references and duplicate-of require inference, and weak inference erodes the central value proposition.

Reverse edges on documents

Documents carry references in both directions:

references — outgoing edges resolved in this stage (what this document points to).
referenced_by — incoming edges (what points to this document), recorded as reverse Relations that include from_id so the source is explicit.

{
  "id": "doc_0007",
  "path": "policies/data/retention.pdf",
  "references":    [ { "type": "references", "to": "legal/gdpr.md" } ],
  "referenced_by": [
    { "type": "references", "to": "policies/data/retention.pdf", "from_id": "guides/onboarding.md" }
  ]
}

Maintaining both directions means an agent can navigate citations either way — “what does this cite?” and “who cites this?” — without re-scanning the whole space.

How it fits the pipeline

Relate runs after Chunk and before Enrich:

01 Walk → 02 Parse → 03 Chunk → 04 Relate → 05 Enrich → 06 Embed+Pack

It depends on the directory graph from Walk (for sibling / parent) and on chunk lineage and neighbor links from Chunk (for continues).
It runs before Enrich, so the LLM in stage 05 can take the resolved graph into account when producing type, topics, and summaries.
It is CPU-bound and largely single-pass; it does not call any model by itself.

Because Relate is a built-in stage with no component slot, you tune it through configuration rather than by swapping a backend. If you need bespoke linking logic, you can add your own pass with a custom stage — for example, inserting a domain-specific reference resolver after Relate.