Tutorial: Your First Knowledge Space

This tutorial takes you all the way from a folder of documents to a portable, queryable knowledge space — building it, reading every line of the per-stage summary, exploring it from the CLI, then doing the same journey in Python and proving the result is portable. By the end you will understand what indx produces and how to use it.

If you have not installed indx yet, start with Installation. For the absolute fastest path, the Quickstart is a two-command tour; this page is the slower, explained version.

1. Set up a sample directory

indx works on a directory, not a single file — the arrangement of files is exactly the signal it keeps. Create a small tree that mixes document types and folders so you can see structure, types, and relations show up later.

Directoryhandbook/
- Directorypolicies/
  - Directorydata/
    retention.pdf
    access-control.md
  - security.md
- Directoryguides/
  - onboarding.md
  - deploying.md
- Directorycontracts/
  - Directory2024/
    acme-msa.pdf
    acme-sow.docx

The nesting matters: files in the same folder become sibling relations, folder lineage becomes parent relations, and cross-file mentions (for example, onboarding.md linking to retention.pdf) become references. Folders like contracts/2024/ carry meaning that survives into every Document and Chunk.

2. Build the knowledge space

Run the build with defaults. The positional argument is the source directory; --out is required and receives the output layout.

# Requires the local bundle: pip install "indx[local]" (bundles bge-m3).
indx ./handbook --out ./ai-ready --llm ollama:qwen2.5 --embedder bge-m3

indx ./handbook --out ./ai-ready --llm none

indx ./handbook --out ./ai-ready --llm none --store jsonl

indx streams one progress line per stage, then a final summary. Here is a representative run:

indx ./handbook → ./ai-ready
  01 walk    9 files, 6 folders
  02 parse   9 ok, 0 skipped
  03 chunk   84 chunks
  04 relate  31 relations
  05 enrich  9 documents (ollama:qwen2.5)
  06 embed   84 vectors → qdrant, sealed handbook.indx
done: 84 chunks, 9 docs, embed_dim=1024  (8.7s)

Reading each line

| Line | What it means | |------|---------------| | 01 walk 9 files, 6 folders | Stage Walk traversed the tree, built the directory graph, and detected file types. Counts are your raw inventory. | | 02 parse 9 ok, 0 skipped | Stage Parse ran each file through the parser (docling) into a ParsedDoc. skipped counts per-item failures recorded as non-fatal StageErrors — 0 means every file parsed. | | 03 chunk 84 chunks | Stage Chunk split parsed content into retrievable Chunks, keeping lineage and continues neighbor links. | | 04 relate 31 relations | Stage Relate resolved typed edges: sibling, parent, references, continues, duplicate-of. | | 05 enrich 9 documents (ollama:qwen2.5) | Stage Enrich used the LLM to add detected type, topics, tags, and summaries per Document. The model in use is named so the run is auditable. | | 06 embed 84 vectors → qdrant, sealed handbook.indx | Stage Embed+Pack vectorized chunks with bge-m3, wrote vectors to the store, and sealed the archive. | | done: … embed_dim=1024 (8.7s) | Final totals plus the embedding dimensionality (1024 for bge-m3) and wall-clock time. |

3. Look at the output layout

--out writes the expanded layout next to the sealed archive, so downstream tools can read either form:

Directoryai-ready/
- handbook.indx
- index.json
- Directorychunks/
  - chunk_0000.json
  - chunk_0001.json
  - …
- Directoryembeddings/
  - manifest.json
  - vectors.f32

handbook.indx is the single portable artifact — a ZIP container holding manifest.json, index.json, the per-chunk files, and the embeddings. This is the file you ship, diff, or hand to a teammate. See the .indx archive reference.
index.json is the serialized knowledge graph: the space version/root, metadata (tool version, config snapshot, embedder), stats, and the full documents, chunks, and relations. Embeddings are not inlined here. See the index.json reference.
chunks/ holds one JSON file per chunk in the canonical chunk shape, optionally with a resolved context window for agent consumption.
embeddings/ holds vectors.f32 (a contiguous little-endian float32 matrix, count × dim) plus a manifest.json pinning the embedder name and dim so a loaded archive can validate query-time compatibility.

The expanded folder and the .indx archive describe the same knowledge space — the archive is just the sealed, movable form.

4. Inspect the archive

indx inspect summarizes a sealed archive: space stats, a document-type histogram, and a sample of relations.

indx inspect ./ai-ready/handbook.indx

handbook.indx  (indx 0.4.2)
  root:       /abs/path/handbook
  documents:  9      chunks: 84      relations: 31
  embeddings: 84     embed_dim: 1024 (bge-m3)

  types
    policy    3
    guide     2
    contract  4

  relations (sample)
    parent       contracts/2024/acme-msa.pdf  →  contracts/2024
    sibling      policies/data/retention.pdf  →  policies/data/access-control.md
    references   guides/onboarding.md         →  policies/data/retention.pdf
    continues    chunk_0007                   →  chunk_0008

How to read it:

The type histogram is space.stats.types — how many documents Enrich classified into each detected type. This is what makes --type/type= filtering possible.
The relations sample shows the typed graph that file-level tools throw away. parent is folder lineage, sibling is same-folder grouping, references is a resolved cross-document mention, and continues links adjacent chunks in a split sequence.

Useful flags: --json emits the full space.stats object, and --documents [type] lists documents (optionally filtered by type). See the CLI reference.

5. Query the archive

indx query embeds your text and returns the top-k matching chunks, each with its score, source, and neighbors.

indx query ./ai-ready/handbook.indx "how long is data retained?" -k 3

#1  score 0.83  policies/data/retention.pdf  (policies/data · policy)
    "Enterprise data is retained for 90 days, after which it is purged…"
    neighbors: chunk_0019, chunk_0021

#2  score 0.71  policies/data/access-control.md  (policies/data · policy)
    "Access to retained datasets is restricted to the data-governance role…"
    neighbors: chunk_0040, chunk_0042

Interpreting a SearchHit:

score — similarity; higher is better. Use it to judge confidence and where to cut off.
source path / folder — provenance from hit.source (path, folder, and detected type). This is what lets an agent cite and filter by location.
chunk text — hit.chunk.text, the retrievable payload that matched.
neighbor ids — the adjacent chunk ids. Resolve them to expand a context window around the hit instead of feeding the agent an orphaned fragment.

Add --type policy to restrict results to one document type, or --json to emit the full SearchHit[] (including .chunk, .neighbors, and .source) for scripting.

6. The same journey in Python

The SDK is the CLI with handles — the KnowledgeSpace is a first-class object you can explore before you ship.

Build with DirectoryPipeline. Construct with your slots (or none, for defaults) and run(src, out).

from indx import DirectoryPipeline

pipeline = DirectoryPipeline(
    parser="docling",
    llm="ollama:qwen2.5",
    embedder="bge-m3",
    store="qdrant",
)
space = pipeline.run("./handbook", "./ai-ready")

print(space.stats.documents, space.stats.chunks, space.stats.embed_dim)

run executes all six stages over ./handbook, writes the output layout to ./ai-ready, and returns the KnowledgeSpace.

Iterate documents by type. space.documents(type=...) filters by detected type; each Document carries its path, topics, and summary.

for doc in space.documents(type="policy"):
    print(doc.path)
    print("  topics: ", doc.topics)
    print("  summary:", doc.summary)

policies/data/retention.pdf
  topics:  ['retention', 'compliance']
  summary: Defines the 90-day retention rule and purge process.
policies/data/access-control.md
  topics:  ['access', 'governance']
  summary: Restricts retained-data access to the data-governance role.

Search and unpack a SearchHit. Same retrieval as the CLI, returned as objects.
```
for hit in space.search("how long is data retained?", k=3):
    print(hit.score, hit.source.path)
    print(hit.chunk.text)
    print("context:", [c.id for c in hit.neighbors])
```
hit.source is a convenience property for hit.chunk.source. The neighbors are resolved Chunk objects, so [c.id for c in hit.neighbors] gives you the adjacent chunk ids — the building blocks of a context window.

No Ollama running? Drop enrichment entirely and keep everything else local:

space = (
    DirectoryPipeline(embedder="bge-m3", store="jsonl")
    .drop("enrich")
    .run("./handbook", "./ai-ready")
)

Documents will have no topics/tags/summary, but the structure, chunks, relations, and vectors are all there. See Bring your own stack.

7. Prove it is portable

The whole point of the .indx archive is that the knowledge space travels. Load the sealed archive on any machine — no re-processing — and re-run the same query.

from indx import KnowledgeSpace

space = KnowledgeSpace.load("./ai-ready/handbook.indx")

print(space.stats.documents, "documents loaded")
for hit in space.search("gdpr compliance", k=5):
    print(hit.score, hit.source.path)

KnowledgeSpace.load verifies the archive’s version, validates checksums, and reconstructs the in-memory models (vectors are memory-mapped from vectors.f32 on demand). The reverse direction is space.save("path.indx"), which seals a space back into a portable archive. Build it in CI, hand it to a teammate, mount it in a serverless function — the same code runs against it.

Where to next

Core concepts The small mental model behind indx — KnowledgeSpace, Document, Chunk, Relation.

The pipeline A stage-by-stage tour of Walk → Parse → Chunk → Relate → Enrich → Embed+Pack.

Configuration Pin your stack in indx.toml and graduate from defaults to a reproducible setup.