Phase 4 · W31–W32

W31–W32: Retrieval Quality (search, ranking, relevance)

Improve retrieval quality with measurable search and ranking so answers with sources are actually reliable.

Suggested time: 4–6 hours/week

Outcomes

A simple search function that returns top-k chunks reliably.
A ranking strategy (even if basic) that beats random results.
A small retrieval test set (questions → expected sources).
Retrieval metrics (precision@k style, but simple).
A “why this chunk” explanation (so debugging is possible).

Deliverables

Retrieval function/script with top-k results and metadata filtering.
Retrieval test set file with 20–40 questions and expected doc_id.
Retrieval report with hit-rate numbers and top failure cases.
Debug output showing query, chosen chunks, and selection reasons.

Prerequisites

W29–W30: Knowledge Base Design (sources, chunking, metadata)

W31–W32: Retrieval Quality (search, ranking, relevance)

What you’re doing

You make your knowledge base *actually useful*.

RAG fails for one boring reason:

retrieval sucks

If retrieval sucks:

the model answers from wrong chunks
you get confident nonsense
and everyone stops trusting the system

So in these 2 weeks, the goal is simple:
make retrieval good enough that “answers with sources” are real.

Time: 4–6 hours/week
Output: a retrieval pipeline with a quality checklist, a small test set, and measurable improvement

The promise (what you’ll have by the end)

By the end of W32 you will have:

A simple search function that returns top-k chunks reliably
A ranking strategy (even if basic) that beats random results
A small retrieval test set (questions → expected sources)
Retrieval metrics (precision@k style, but simple)
A “why this chunk” explanation (so debugging is possible)

The rule: don’t blame the model

If answers are wrong, 80% of the time retrieval is the problem.

So you fix:

before you touch prompts.

chunking
metadata
ranking
filtering

Build retrieval like an engineer

1) Define your query types

Pick 3 common query types:

“How do I do X?” (runbook/procedure)
“Why did X happen?” (RCA/known issue)
“What mapping/rule applies?” (mapping/DQ)

Different query types need different sources. That’s why metadata matters.

2) Use metadata filtering (stop random chunks)

Examples:

system=MDG
object=BP
process=interface
doc_type=runbook

If you don’t filter, you get noise.

3) Ranking basics (cheap and effective)

Start with:

keyword match boost (titles, headings)
recency boost (updated_at)
doc_type boost (runbook > random notes)
tag overlap boost

You don’t need perfect ML ranking to get big gains.

4) Create a retrieval test set

Make 20–40 questions.
For each question:

This becomes your benchmark.

expected doc_id (or 2–3 acceptable docs)

Without a benchmark, you will do vibes again.

5) Measure something simple

Metrics you can use:

“did expected doc appear in top 5?”
“did expected doc appear in top 10?”

That’s enough for v1.

6) Debug the failures

When a question fails:

Fix the docs and metadata first.

is chunk too small / too big?
is metadata missing?
is title unclear?
do we need synonyms/tags?

Deliverables (you must ship these)

Deliverable A — Retrieval function

A function/script that returns top-k chunks for a query
Includes metadata filtering support

Deliverable B — Retrieval test set

A file with 20–40 questions
Each has expected source doc_id

Deliverable C — Retrieval report

top-k hit rate numbers
top failure cases and fixes

Deliverable D — Debug tooling

A way to print: query → top chunks → why they were picked (scores/fields)

Common traps (don’t do this)

No. Retrieval quality is your job.

Trap 1: “Let the model figure it out.”

Then you’re searching a junkyard.

Trap 2: “No metadata filters.”

Then you’re not improving — you’re changing.

Trap 3: “No benchmark.”

Quick self-check (2 minutes)

Answer yes/no:

Do I have a test set of questions with expected sources?
Can I measure top-5/top-10 hit rate?
Do I filter by metadata to reduce noise?
Can I debug why a chunk was retrieved?
Did I fix docs/metadata before touching prompts?

If any “no” — fix it before moving on.

Next module: W33–W34 — W33–W34: Guardrails & Governance (what is allowed, what is not)