Phase 4 · W31–W32
W31–W32: Retrieval Quality (search, ranking, relevance)
Improve retrieval quality with measurable search and ranking so answers with sources are actually reliable.
Suggested time: 4–6 hours/week
Outcomes
- A simple search function that returns top-k chunks reliably.
- A ranking strategy (even if basic) that beats random results.
- A small retrieval test set (questions → expected sources).
- Retrieval metrics (precision@k style, but simple).
- A “why this chunk” explanation (so debugging is possible).
Deliverables
- Retrieval function/script with top-k results and metadata filtering.
- Retrieval test set file with 20–40 questions and expected doc_id.
- Retrieval report with hit-rate numbers and top failure cases.
- Debug output showing query, chosen chunks, and selection reasons.
Prerequisites
- W29–W30: Knowledge Base Design (sources, chunking, metadata)
W31–W32: Retrieval Quality (search, ranking, relevance)
What you’re doing
You make your knowledge base *actually useful*.
RAG fails for one boring reason:
- retrieval sucks
If retrieval sucks:
- the model answers from wrong chunks
- you get confident nonsense
- and everyone stops trusting the system
So in these 2 weeks, the goal is simple:
make retrieval good enough that “answers with sources” are real.
Time: 4–6 hours/week
Output: a retrieval pipeline with a quality checklist, a small test set, and measurable improvement
The promise (what you’ll have by the end)
By the end of W32 you will have:
- A simple search function that returns top-k chunks reliably
- A ranking strategy (even if basic) that beats random results
- A small retrieval test set (questions → expected sources)
- Retrieval metrics (precision@k style, but simple)
- A “why this chunk” explanation (so debugging is possible)
The rule: don’t blame the model
If answers are wrong, 80% of the time retrieval is the problem.
So you fix:
before you touch prompts.
- chunking
- metadata
- ranking
- filtering
Build retrieval like an engineer
1) Define your query types
Pick 3 common query types:
- “How do I do X?” (runbook/procedure)
- “Why did X happen?” (RCA/known issue)
- “What mapping/rule applies?” (mapping/DQ)
Different query types need different sources. That’s why metadata matters.
2) Use metadata filtering (stop random chunks)
Examples:
- system=MDG
- object=BP
- process=interface
- doc_type=runbook
If you don’t filter, you get noise.
3) Ranking basics (cheap and effective)
Start with:
- keyword match boost (titles, headings)
- recency boost (updated_at)
- doc_type boost (runbook > random notes)
- tag overlap boost
You don’t need perfect ML ranking to get big gains.
4) Create a retrieval test set
Make 20–40 questions.
For each question:
This becomes your benchmark.
- expected doc_id (or 2–3 acceptable docs)
Without a benchmark, you will do vibes again.
5) Measure something simple
Metrics you can use:
- “did expected doc appear in top 5?”
- “did expected doc appear in top 10?”
That’s enough for v1.
6) Debug the failures
When a question fails:
Fix the docs and metadata first.
- is chunk too small / too big?
- is metadata missing?
- is title unclear?
- do we need synonyms/tags?
Deliverables (you must ship these)
Deliverable A — Retrieval function
- A function/script that returns top-k chunks for a query
- Includes metadata filtering support
Deliverable B — Retrieval test set
- A file with 20–40 questions
- Each has expected source doc_id
Deliverable C — Retrieval report
- top-k hit rate numbers
- top failure cases and fixes
Deliverable D — Debug tooling
- A way to print: query → top chunks → why they were picked (scores/fields)
Common traps (don’t do this)
No. Retrieval quality is your job.
- Trap 1: “Let the model figure it out.”
Then you’re searching a junkyard.
- Trap 2: “No metadata filters.”
Then you’re not improving — you’re changing.
- Trap 3: “No benchmark.”
Quick self-check (2 minutes)
Answer yes/no:
- Do I have a test set of questions with expected sources?
- Can I measure top-5/top-10 hit rate?
- Do I filter by metadata to reduce noise?
- Can I debug why a chunk was retrieved?
- Did I fix docs/metadata before touching prompts?
If any “no” — fix it before moving on.