Skip to content

Tutorial: Your First Knowledge Space

This tutorial takes you all the way from a folder of documents to a portable, queryable knowledge space — building it, reading every line of the per-stage summary, exploring it from the CLI, then doing the same journey in Python and proving the result is portable. By the end you will understand what indx produces and how to use it.

If you have not installed indx yet, start with Installation. For the absolute fastest path, the Quickstart is a two-command tour; this page is the slower, explained version.

indx works on a directory, not a single file — the arrangement of files is exactly the signal it keeps. Create a small tree that mixes document types and folders so you can see structure, types, and relations show up later.

  • Directoryhandbook/
    • Directorypolicies/
      • Directorydata/
        • retention.pdf
        • access-control.md
      • security.md
    • Directoryguides/
      • onboarding.md
      • deploying.md
    • Directorycontracts/
      • Directory2024/
        • acme-msa.pdf
        • acme-sow.docx

The nesting matters: files in the same folder become sibling relations, folder lineage becomes parent relations, and cross-file mentions (for example, onboarding.md linking to retention.pdf) become references. Folders like contracts/2024/ carry meaning that survives into every Document and Chunk.

Run the build with defaults. The positional argument is the source directory; --out is required and receives the output layout.

Terminal window
# Requires the local bundle: pip install "indx[local]" (bundles bge-m3).
indx ./handbook --out ./ai-ready --llm ollama:qwen2.5 --embedder bge-m3

indx streams one progress line per stage, then a final summary. Here is a representative run:

indx ./handbook → ./ai-ready
01 walk 9 files, 6 folders
02 parse 9 ok, 0 skipped
03 chunk 84 chunks
04 relate 31 relations
05 enrich 9 documents (ollama:qwen2.5)
06 embed 84 vectors → qdrant, sealed handbook.indx
done: 84 chunks, 9 docs, embed_dim=1024 (8.7s)

| Line | What it means | |------|---------------| | 01 walk 9 files, 6 folders | Stage Walk traversed the tree, built the directory graph, and detected file types. Counts are your raw inventory. | | 02 parse 9 ok, 0 skipped | Stage Parse ran each file through the parser (docling) into a ParsedDoc. skipped counts per-item failures recorded as non-fatal StageErrors — 0 means every file parsed. | | 03 chunk 84 chunks | Stage Chunk split parsed content into retrievable Chunks, keeping lineage and continues neighbor links. | | 04 relate 31 relations | Stage Relate resolved typed edges: sibling, parent, references, continues, duplicate-of. | | 05 enrich 9 documents (ollama:qwen2.5) | Stage Enrich used the LLM to add detected type, topics, tags, and summaries per Document. The model in use is named so the run is auditable. | | 06 embed 84 vectors → qdrant, sealed handbook.indx | Stage Embed+Pack vectorized chunks with bge-m3, wrote vectors to the store, and sealed the archive. | | done: … embed_dim=1024 (8.7s) | Final totals plus the embedding dimensionality (1024 for bge-m3) and wall-clock time. |

--out writes the expanded layout next to the sealed archive, so downstream tools can read either form:

  • Directoryai-ready/
    • handbook.indx
    • index.json
    • Directorychunks/
      • chunk_0000.json
      • chunk_0001.json
    • Directoryembeddings/
      • manifest.json
      • vectors.f32
  • handbook.indx is the single portable artifact — a ZIP container holding manifest.json, index.json, the per-chunk files, and the embeddings. This is the file you ship, diff, or hand to a teammate. See the .indx archive reference.
  • index.json is the serialized knowledge graph: the space version/root, metadata (tool version, config snapshot, embedder), stats, and the full documents, chunks, and relations. Embeddings are not inlined here. See the index.json reference.
  • chunks/ holds one JSON file per chunk in the canonical chunk shape, optionally with a resolved context window for agent consumption.
  • embeddings/ holds vectors.f32 (a contiguous little-endian float32 matrix, count × dim) plus a manifest.json pinning the embedder name and dim so a loaded archive can validate query-time compatibility.

The expanded folder and the .indx archive describe the same knowledge space — the archive is just the sealed, movable form.

indx inspect summarizes a sealed archive: space stats, a document-type histogram, and a sample of relations.

Terminal window
indx inspect ./ai-ready/handbook.indx
handbook.indx (indx 0.4.2)
root: /abs/path/handbook
documents: 9 chunks: 84 relations: 31
embeddings: 84 embed_dim: 1024 (bge-m3)
types
policy 3
guide 2
contract 4
relations (sample)
parent contracts/2024/acme-msa.pdf → contracts/2024
sibling policies/data/retention.pdf → policies/data/access-control.md
references guides/onboarding.md → policies/data/retention.pdf
continues chunk_0007 → chunk_0008

How to read it:

  • The type histogram is space.stats.types — how many documents Enrich classified into each detected type. This is what makes --type/type= filtering possible.
  • The relations sample shows the typed graph that file-level tools throw away. parent is folder lineage, sibling is same-folder grouping, references is a resolved cross-document mention, and continues links adjacent chunks in a split sequence.

Useful flags: --json emits the full space.stats object, and --documents [type] lists documents (optionally filtered by type). See the CLI reference.

indx query embeds your text and returns the top-k matching chunks, each with its score, source, and neighbors.

Terminal window
indx query ./ai-ready/handbook.indx "how long is data retained?" -k 3
#1 score 0.83 policies/data/retention.pdf (policies/data · policy)
"Enterprise data is retained for 90 days, after which it is purged…"
neighbors: chunk_0019, chunk_0021
#2 score 0.71 policies/data/access-control.md (policies/data · policy)
"Access to retained datasets is restricted to the data-governance role…"
neighbors: chunk_0040, chunk_0042

Interpreting a SearchHit:

  • score — similarity; higher is better. Use it to judge confidence and where to cut off.
  • source path / folder — provenance from hit.source (path, folder, and detected type). This is what lets an agent cite and filter by location.
  • chunk texthit.chunk.text, the retrievable payload that matched.
  • neighbor ids — the adjacent chunk ids. Resolve them to expand a context window around the hit instead of feeding the agent an orphaned fragment.

Add --type policy to restrict results to one document type, or --json to emit the full SearchHit[] (including .chunk, .neighbors, and .source) for scripting.

The SDK is the CLI with handles — the KnowledgeSpace is a first-class object you can explore before you ship.

  1. Build with DirectoryPipeline. Construct with your slots (or none, for defaults) and run(src, out).

    from indx import DirectoryPipeline
    pipeline = DirectoryPipeline(
    parser="docling",
    llm="ollama:qwen2.5",
    embedder="bge-m3",
    store="qdrant",
    )
    space = pipeline.run("./handbook", "./ai-ready")
    print(space.stats.documents, space.stats.chunks, space.stats.embed_dim)

    run executes all six stages over ./handbook, writes the output layout to ./ai-ready, and returns the KnowledgeSpace.

  2. Iterate documents by type. space.documents(type=...) filters by detected type; each Document carries its path, topics, and summary.

    for doc in space.documents(type="policy"):
    print(doc.path)
    print(" topics: ", doc.topics)
    print(" summary:", doc.summary)
    policies/data/retention.pdf
    topics: ['retention', 'compliance']
    summary: Defines the 90-day retention rule and purge process.
    policies/data/access-control.md
    topics: ['access', 'governance']
    summary: Restricts retained-data access to the data-governance role.
  3. Search and unpack a SearchHit. Same retrieval as the CLI, returned as objects.

    for hit in space.search("how long is data retained?", k=3):
    print(hit.score, hit.source.path)
    print(hit.chunk.text)
    print("context:", [c.id for c in hit.neighbors])

    hit.source is a convenience property for hit.chunk.source. The neighbors are resolved Chunk objects, so [c.id for c in hit.neighbors] gives you the adjacent chunk ids — the building blocks of a context window.

The whole point of the .indx archive is that the knowledge space travels. Load the sealed archive on any machine — no re-processing — and re-run the same query.

from indx import KnowledgeSpace
space = KnowledgeSpace.load("./ai-ready/handbook.indx")
print(space.stats.documents, "documents loaded")
for hit in space.search("gdpr compliance", k=5):
print(hit.score, hit.source.path)

KnowledgeSpace.load verifies the archive’s version, validates checksums, and reconstructs the in-memory models (vectors are memory-mapped from vectors.f32 on demand). The reverse direction is space.save("path.indx"), which seals a space back into a portable archive. Build it in CI, hand it to a teammate, mount it in a serverless function — the same code runs against it.