Tutorial: Your First Knowledge Space
This tutorial takes you all the way from a folder of documents to a portable, queryable knowledge space — building it, reading every line of the per-stage summary, exploring it from the CLI, then doing the same journey in Python and proving the result is portable. By the end you will understand what indx produces and how to use it.
If you have not installed indx yet, start with Installation. For the absolute fastest path, the Quickstart is a two-command tour; this page is the slower, explained version.
1. Set up a sample directory
Section titled “1. Set up a sample directory”indx works on a directory, not a single file — the arrangement of files is exactly the signal it keeps. Create a small tree that mixes document types and folders so you can see structure, types, and relations show up later.
Directoryhandbook/
Directorypolicies/
Directorydata/
- retention.pdf
- access-control.md
- security.md
Directoryguides/
- onboarding.md
- deploying.md
Directorycontracts/
Directory2024/
- acme-msa.pdf
- acme-sow.docx
The nesting matters: files in the same folder become sibling relations, folder lineage becomes parent relations, and cross-file mentions (for example, onboarding.md linking to retention.pdf) become references. Folders like contracts/2024/ carry meaning that survives into every Document and Chunk.
2. Build the knowledge space
Section titled “2. Build the knowledge space”Run the build with defaults. The positional argument is the source directory; --out is required and receives the output layout.
# Requires the local bundle: pip install "indx[local]" (bundles bge-m3).indx ./handbook --out ./ai-ready --llm ollama:qwen2.5 --embedder bge-m3indx ./handbook --out ./ai-ready --llm noneindx ./handbook --out ./ai-ready --llm none --store jsonlindx streams one progress line per stage, then a final summary. Here is a representative run:
indx ./handbook → ./ai-ready 01 walk 9 files, 6 folders 02 parse 9 ok, 0 skipped 03 chunk 84 chunks 04 relate 31 relations 05 enrich 9 documents (ollama:qwen2.5) 06 embed 84 vectors → qdrant, sealed handbook.indxdone: 84 chunks, 9 docs, embed_dim=1024 (8.7s)Reading each line
Section titled “Reading each line”| Line | What it means |
|------|---------------|
| 01 walk 9 files, 6 folders | Stage Walk traversed the tree, built the directory graph, and detected file types. Counts are your raw inventory. |
| 02 parse 9 ok, 0 skipped | Stage Parse ran each file through the parser (docling) into a ParsedDoc. skipped counts per-item failures recorded as non-fatal StageErrors — 0 means every file parsed. |
| 03 chunk 84 chunks | Stage Chunk split parsed content into retrievable Chunks, keeping lineage and continues neighbor links. |
| 04 relate 31 relations | Stage Relate resolved typed edges: sibling, parent, references, continues, duplicate-of. |
| 05 enrich 9 documents (ollama:qwen2.5) | Stage Enrich used the LLM to add detected type, topics, tags, and summaries per Document. The model in use is named so the run is auditable. |
| 06 embed 84 vectors → qdrant, sealed handbook.indx | Stage Embed+Pack vectorized chunks with bge-m3, wrote vectors to the store, and sealed the archive. |
| done: … embed_dim=1024 (8.7s) | Final totals plus the embedding dimensionality (1024 for bge-m3) and wall-clock time. |
3. Look at the output layout
Section titled “3. Look at the output layout”--out writes the expanded layout next to the sealed archive, so downstream tools can read either form:
Directoryai-ready/
- handbook.indx
- index.json
Directorychunks/
- chunk_0000.json
- chunk_0001.json
- …
Directoryembeddings/
- manifest.json
- vectors.f32
handbook.indxis the single portable artifact — a ZIP container holdingmanifest.json,index.json, the per-chunk files, and the embeddings. This is the file you ship, diff, or hand to a teammate. See the .indx archive reference.index.jsonis the serialized knowledge graph: the spaceversion/root,metadata(tool version, config snapshot, embedder),stats, and the fulldocuments,chunks, andrelations. Embeddings are not inlined here. See the index.json reference.chunks/holds one JSON file per chunk in the canonical chunk shape, optionally with a resolved context window for agent consumption.embeddings/holdsvectors.f32(a contiguous little-endian float32 matrix,count × dim) plus amanifest.jsonpinning the embedder name anddimso a loaded archive can validate query-time compatibility.
The expanded folder and the .indx archive describe the same knowledge space — the archive is just the sealed, movable form.
4. Inspect the archive
Section titled “4. Inspect the archive”indx inspect summarizes a sealed archive: space stats, a document-type histogram, and a sample of relations.
indx inspect ./ai-ready/handbook.indxhandbook.indx (indx 0.4.2) root: /abs/path/handbook documents: 9 chunks: 84 relations: 31 embeddings: 84 embed_dim: 1024 (bge-m3)
types policy 3 guide 2 contract 4
relations (sample) parent contracts/2024/acme-msa.pdf → contracts/2024 sibling policies/data/retention.pdf → policies/data/access-control.md references guides/onboarding.md → policies/data/retention.pdf continues chunk_0007 → chunk_0008How to read it:
- The type histogram is
space.stats.types— how many documents Enrich classified into each detected type. This is what makes--type/type=filtering possible. - The relations sample shows the typed graph that file-level tools throw away.
parentis folder lineage,siblingis same-folder grouping,referencesis a resolved cross-document mention, andcontinueslinks adjacent chunks in a split sequence.
Useful flags: --json emits the full space.stats object, and --documents [type] lists documents (optionally filtered by type). See the CLI reference.
5. Query the archive
Section titled “5. Query the archive”indx query embeds your text and returns the top-k matching chunks, each with its score, source, and neighbors.
indx query ./ai-ready/handbook.indx "how long is data retained?" -k 3#1 score 0.83 policies/data/retention.pdf (policies/data · policy) "Enterprise data is retained for 90 days, after which it is purged…" neighbors: chunk_0019, chunk_0021
#2 score 0.71 policies/data/access-control.md (policies/data · policy) "Access to retained datasets is restricted to the data-governance role…" neighbors: chunk_0040, chunk_0042Interpreting a SearchHit:
score— similarity; higher is better. Use it to judge confidence and where to cut off.- source path / folder — provenance from
hit.source(path,folder, and detectedtype). This is what lets an agent cite and filter by location. - chunk text —
hit.chunk.text, the retrievable payload that matched. - neighbor ids — the adjacent chunk ids. Resolve them to expand a context window around the hit instead of feeding the agent an orphaned fragment.
Add --type policy to restrict results to one document type, or --json to emit the full SearchHit[] (including .chunk, .neighbors, and .source) for scripting.
6. The same journey in Python
Section titled “6. The same journey in Python”The SDK is the CLI with handles — the KnowledgeSpace is a first-class object you can explore before you ship.
-
Build with
DirectoryPipeline. Construct with your slots (or none, for defaults) andrun(src, out).from indx import DirectoryPipelinepipeline = DirectoryPipeline(parser="docling",llm="ollama:qwen2.5",embedder="bge-m3",store="qdrant",)space = pipeline.run("./handbook", "./ai-ready")print(space.stats.documents, space.stats.chunks, space.stats.embed_dim)runexecutes all six stages over./handbook, writes the output layout to./ai-ready, and returns theKnowledgeSpace. -
Iterate documents by type.
space.documents(type=...)filters by detected type; eachDocumentcarries its path, topics, and summary.for doc in space.documents(type="policy"):print(doc.path)print(" topics: ", doc.topics)print(" summary:", doc.summary)policies/data/retention.pdftopics: ['retention', 'compliance']summary: Defines the 90-day retention rule and purge process.policies/data/access-control.mdtopics: ['access', 'governance']summary: Restricts retained-data access to the data-governance role. -
Search and unpack a
SearchHit. Same retrieval as the CLI, returned as objects.for hit in space.search("how long is data retained?", k=3):print(hit.score, hit.source.path)print(hit.chunk.text)print("context:", [c.id for c in hit.neighbors])hit.sourceis a convenience property forhit.chunk.source. Theneighborsare resolvedChunkobjects, so[c.id for c in hit.neighbors]gives you the adjacent chunk ids — the building blocks of a context window.
7. Prove it is portable
Section titled “7. Prove it is portable”The whole point of the .indx archive is that the knowledge space travels. Load the sealed archive on any machine — no re-processing — and re-run the same query.
from indx import KnowledgeSpace
space = KnowledgeSpace.load("./ai-ready/handbook.indx")
print(space.stats.documents, "documents loaded")for hit in space.search("gdpr compliance", k=5): print(hit.score, hit.source.path)KnowledgeSpace.load verifies the archive’s version, validates checksums, and reconstructs the in-memory models (vectors are memory-mapped from vectors.f32 on demand). The reverse direction is space.save("path.indx"), which seals a space back into a portable archive. Build it in CI, hand it to a teammate, mount it in a serverless function — the same code runs against it.