Phase 3 · W25–W26

W25–W26: Evaluation (accuracy, drift, regressions)

Build a repeatable evaluation harness that catches regressions, tracks drift, and proves quality over time.

Suggested time: 4–6 hours/week

Outcomes

A repeatable eval run against your golden set.
Basic metrics (accuracy, confusion, needs_human rate).
A regression guard (fail if quality drops).
A simple drift check (distribution changes over time).
A short report you can read in 2 minutes.

Deliverables

One command to run evaluation with metrics and report output.
Regression thresholds that fail the script/build on quality drops.
Stored drift snapshots with comparison to previous runs.
Short report including accuracy, needs_human, confusions, and next fixes.

Prerequisites

W23–W24: Clustering & Recurring Issue Detection

W25–W26: Evaluation (accuracy, drift, regressions)

What you’re doing

You stop trusting “it feels good”.

In ops, “feels good” is how you ship nonsense.
Evaluation is how you:

prove improvement
catch regressions
detect drift
keep reliability over time

Time: 4–6 hours/week
Output: an evaluation harness that measures classification quality and alerts you when things get worse

The promise (what you’ll have by the end)

By the end of W26 you will have:

A repeatable eval run against your golden set
Basic metrics (accuracy, confusion, needs_human rate)
A regression guard (fail if quality drops)
A simple drift check (distribution changes over time)
A short report you can read in 2 minutes

The rule: if you can’t measure it, you can’t improve it

No numbers = vibes.
Vibes = failure.

What to measure (keep it simple)

1) Classification quality

Track:

primary label accuracy
confusion pairs (A mistaken as B)
low-confidence rate
needs_human rate

2) Coverage vs safety

Sometimes “needs_human” is good.
You want:

high accuracy when it answers
and a safe fallback when unsure

So track both:

accuracy_on_answered
% needs_human

3) Stability over time (drift)

Track:

label distribution over time
top keywords/error codes frequency changes
cluster sizes changes (from W23–W24)

If distribution changes, your prompts/rules may need updates.

Step-by-step checklist

1) Freeze your golden set

Golden set must be:

stable
versioned
not edited casually

Put it in repo (anonymized) or keep a checksum.

2) Build an eval runner

One command:

loads golden set
runs your classifier
compares predictions to expected labels
outputs metrics + report

Keep it boring and automated.

3) Generate a confusion summary

Show:

top confusion pairs
example tickets for each confusion

This tells you what to fix next (prompt/rules).

4) Add a regression guard

Define thresholds like:

accuracy must be >= X
needs_human must be <= Y (or within band)
invalid_json must be 0

If thresholds fail:

exit code non-zero
report says “REGRESSION”

This is how you stop shipping worse versions.

5) Add drift checks

Store historical stats from previous runs:

label distribution
avg confidence
needs_human rate

If distribution shifts hard, flag it:

“Possible drift: label X doubled”

No fancy math required for v1.

Deliverables (you must ship these)

Deliverable A — Eval runner

One command to run evaluation
Outputs metrics + report

Deliverable B — Regression guard

Thresholds defined
Build fails (or script fails) if regression detected

Deliverable C — Drift snapshot

A stored snapshot of label distribution and confidence
A comparison to previous snapshot

Deliverable D — Short report

Accuracy + needs_human
Top confusions
What to fix next

Common traps (don’t do this)

Later means never. Eval is the foundation.

Trap 1: “I’ll evaluate later.”

In ops, safe fallback matters too.

Trap 2: “Only accuracy matters.”

If golden set changes every week, your metrics are meaningless.

Trap 3: “Golden set keeps changing.”

Quick self-check (2 minutes)

Answer yes/no:

Do I have a stable golden set?
Can I run eval in one command?
Do I see confusion pairs and examples?
Do I have a regression threshold that fails builds?
Do I store drift snapshots over time?

If any “no” — fix it before moving on.

Next module: W27–W28 — W27–W28: Operational Metrics & Reporting