Phase 4 · W31–W32

W31–W32: Retrieval Quality (search, ranking, relevance)

Improve retrieval quality with measurable search and ranking so answers with sources are actually reliable.

Suggested time: 4–6 hours/week

Outcomes

  • A simple search function that returns top-k chunks reliably.
  • A ranking strategy (even if basic) that beats random results.
  • A small retrieval test set (questions → expected sources).
  • Retrieval metrics (precision@k style, but simple).
  • A “why this chunk” explanation (so debugging is possible).

Deliverables

  • Retrieval function/script with top-k results and metadata filtering.
  • Retrieval test set file with 20–40 questions and expected doc_id.
  • Retrieval report with hit-rate numbers and top failure cases.
  • Debug output showing query, chosen chunks, and selection reasons.

Prerequisites

  • W29–W30: Knowledge Base Design (sources, chunking, metadata)

W31–W32: Retrieval Quality (search, ranking, relevance)

What you’re doing

You make your knowledge base *actually useful*.

RAG fails for one boring reason:

  • retrieval sucks

If retrieval sucks:

  • the model answers from wrong chunks
  • you get confident nonsense
  • and everyone stops trusting the system

So in these 2 weeks, the goal is simple:
make retrieval good enough that “answers with sources” are real.

Time: 4–6 hours/week
Output: a retrieval pipeline with a quality checklist, a small test set, and measurable improvement


The promise (what you’ll have by the end)

By the end of W32 you will have:

  • A simple search function that returns top-k chunks reliably
  • A ranking strategy (even if basic) that beats random results
  • A small retrieval test set (questions → expected sources)
  • Retrieval metrics (precision@k style, but simple)
  • A “why this chunk” explanation (so debugging is possible)

The rule: don’t blame the model

If answers are wrong, 80% of the time retrieval is the problem.

So you fix:

before you touch prompts.

  • chunking
  • metadata
  • ranking
  • filtering

Build retrieval like an engineer

1) Define your query types

Pick 3 common query types:

  • “How do I do X?” (runbook/procedure)
  • “Why did X happen?” (RCA/known issue)
  • “What mapping/rule applies?” (mapping/DQ)

Different query types need different sources. That’s why metadata matters.

2) Use metadata filtering (stop random chunks)

Examples:

  • system=MDG
  • object=BP
  • process=interface
  • doc_type=runbook

If you don’t filter, you get noise.

3) Ranking basics (cheap and effective)

Start with:

  • keyword match boost (titles, headings)
  • recency boost (updated_at)
  • doc_type boost (runbook > random notes)
  • tag overlap boost

You don’t need perfect ML ranking to get big gains.

4) Create a retrieval test set

Make 20–40 questions.
For each question:

This becomes your benchmark.

  • expected doc_id (or 2–3 acceptable docs)

Without a benchmark, you will do vibes again.

5) Measure something simple

Metrics you can use:

  • “did expected doc appear in top 5?”
  • “did expected doc appear in top 10?”

That’s enough for v1.

6) Debug the failures

When a question fails:

Fix the docs and metadata first.

  • is chunk too small / too big?
  • is metadata missing?
  • is title unclear?
  • do we need synonyms/tags?

Deliverables (you must ship these)

Deliverable A — Retrieval function

  • A function/script that returns top-k chunks for a query
  • Includes metadata filtering support

Deliverable B — Retrieval test set

  • A file with 20–40 questions
  • Each has expected source doc_id

Deliverable C — Retrieval report

  • top-k hit rate numbers
  • top failure cases and fixes

Deliverable D — Debug tooling

  • A way to print: query → top chunks → why they were picked (scores/fields)

Common traps (don’t do this)

No. Retrieval quality is your job.

  • Trap 1: “Let the model figure it out.”

Then you’re searching a junkyard.

  • Trap 2: “No metadata filters.”

Then you’re not improving — you’re changing.

  • Trap 3: “No benchmark.”

Quick self-check (2 minutes)

Answer yes/no:

  • Do I have a test set of questions with expected sources?
  • Can I measure top-5/top-10 hit rate?
  • Do I filter by metadata to reduce noise?
  • Can I debug why a chunk was retrieved?
  • Did I fix docs/metadata before touching prompts?

If any “no” — fix it before moving on.


Next module: W33–W34W33–W34: Guardrails & Governance (what is allowed, what is not)