Phase 3 · W25–W26
W25–W26: Evaluation (accuracy, drift, regressions)
Build a repeatable evaluation harness that catches regressions, tracks drift, and proves quality over time.
Suggested time: 4–6 hours/week
Outcomes
- A repeatable eval run against your golden set.
- Basic metrics (accuracy, confusion, needs_human rate).
- A regression guard (fail if quality drops).
- A simple drift check (distribution changes over time).
- A short report you can read in 2 minutes.
Deliverables
- One command to run evaluation with metrics and report output.
- Regression thresholds that fail the script/build on quality drops.
- Stored drift snapshots with comparison to previous runs.
- Short report including accuracy, needs_human, confusions, and next fixes.
Prerequisites
- W23–W24: Clustering & Recurring Issue Detection
W25–W26: Evaluation (accuracy, drift, regressions)
What you’re doing
You stop trusting “it feels good”.
In ops, “feels good” is how you ship nonsense.
Evaluation is how you:
- prove improvement
- catch regressions
- detect drift
- keep reliability over time
Time: 4–6 hours/week
Output: an evaluation harness that measures classification quality and alerts you when things get worse
The promise (what you’ll have by the end)
By the end of W26 you will have:
- A repeatable eval run against your golden set
- Basic metrics (accuracy, confusion, needs_human rate)
- A regression guard (fail if quality drops)
- A simple drift check (distribution changes over time)
- A short report you can read in 2 minutes
The rule: if you can’t measure it, you can’t improve it
No numbers = vibes.
Vibes = failure.
What to measure (keep it simple)
1) Classification quality
Track:
- primary label accuracy
- confusion pairs (A mistaken as B)
- low-confidence rate
- needs_human rate
2) Coverage vs safety
Sometimes “needs_human” is good.
You want:
- high accuracy when it answers
- and a safe fallback when unsure
So track both:
- accuracy_on_answered
- % needs_human
3) Stability over time (drift)
Track:
- label distribution over time
- top keywords/error codes frequency changes
- cluster sizes changes (from W23–W24)
If distribution changes, your prompts/rules may need updates.
Step-by-step checklist
1) Freeze your golden set
Golden set must be:
- stable
- versioned
- not edited casually
Put it in repo (anonymized) or keep a checksum.
2) Build an eval runner
One command:
- loads golden set
- runs your classifier
- compares predictions to expected labels
- outputs metrics + report
Keep it boring and automated.
3) Generate a confusion summary
Show:
- top confusion pairs
- example tickets for each confusion
This tells you what to fix next (prompt/rules).
4) Add a regression guard
Define thresholds like:
- accuracy must be >= X
- needs_human must be <= Y (or within band)
- invalid_json must be 0
If thresholds fail:
- exit code non-zero
- report says “REGRESSION”
This is how you stop shipping worse versions.
5) Add drift checks
Store historical stats from previous runs:
- label distribution
- avg confidence
- needs_human rate
If distribution shifts hard, flag it:
- “Possible drift: label X doubled”
No fancy math required for v1.
Deliverables (you must ship these)
Deliverable A — Eval runner
- One command to run evaluation
- Outputs metrics + report
Deliverable B — Regression guard
- Thresholds defined
- Build fails (or script fails) if regression detected
Deliverable C — Drift snapshot
- A stored snapshot of label distribution and confidence
- A comparison to previous snapshot
Deliverable D — Short report
- Accuracy + needs_human
- Top confusions
- What to fix next
Common traps (don’t do this)
Later means never. Eval is the foundation.
- Trap 1: “I’ll evaluate later.”
In ops, safe fallback matters too.
- Trap 2: “Only accuracy matters.”
If golden set changes every week, your metrics are meaningless.
- Trap 3: “Golden set keeps changing.”
Quick self-check (2 minutes)
Answer yes/no:
- Do I have a stable golden set?
- Can I run eval in one command?
- Do I see confusion pairs and examples?
- Do I have a regression threshold that fails builds?
- Do I store drift snapshots over time?
If any “no” — fix it before moving on.