Phase 3 · W25–W26

W25–W26: Evaluation (accuracy, drift, regressions)

Build a repeatable evaluation harness that catches regressions, tracks drift, and proves quality over time.

Suggested time: 4–6 hours/week

Outcomes

  • A repeatable eval run against your golden set.
  • Basic metrics (accuracy, confusion, needs_human rate).
  • A regression guard (fail if quality drops).
  • A simple drift check (distribution changes over time).
  • A short report you can read in 2 minutes.

Deliverables

  • One command to run evaluation with metrics and report output.
  • Regression thresholds that fail the script/build on quality drops.
  • Stored drift snapshots with comparison to previous runs.
  • Short report including accuracy, needs_human, confusions, and next fixes.

Prerequisites

  • W23–W24: Clustering & Recurring Issue Detection

W25–W26: Evaluation (accuracy, drift, regressions)

What you’re doing

You stop trusting “it feels good”.

In ops, “feels good” is how you ship nonsense.
Evaluation is how you:

  • prove improvement
  • catch regressions
  • detect drift
  • keep reliability over time

Time: 4–6 hours/week
Output: an evaluation harness that measures classification quality and alerts you when things get worse


The promise (what you’ll have by the end)

By the end of W26 you will have:

  • A repeatable eval run against your golden set
  • Basic metrics (accuracy, confusion, needs_human rate)
  • A regression guard (fail if quality drops)
  • A simple drift check (distribution changes over time)
  • A short report you can read in 2 minutes

The rule: if you can’t measure it, you can’t improve it

No numbers = vibes.
Vibes = failure.


What to measure (keep it simple)

1) Classification quality

Track:

  • primary label accuracy
  • confusion pairs (A mistaken as B)
  • low-confidence rate
  • needs_human rate

2) Coverage vs safety

Sometimes “needs_human” is good.
You want:

  • high accuracy when it answers
  • and a safe fallback when unsure

So track both:

  • accuracy_on_answered
  • % needs_human

3) Stability over time (drift)

Track:

  • label distribution over time
  • top keywords/error codes frequency changes
  • cluster sizes changes (from W23–W24)

If distribution changes, your prompts/rules may need updates.


Step-by-step checklist

1) Freeze your golden set

Golden set must be:

  • stable
  • versioned
  • not edited casually

Put it in repo (anonymized) or keep a checksum.

2) Build an eval runner

One command:

  • loads golden set
  • runs your classifier
  • compares predictions to expected labels
  • outputs metrics + report

Keep it boring and automated.

3) Generate a confusion summary

Show:

  • top confusion pairs
  • example tickets for each confusion

This tells you what to fix next (prompt/rules).

4) Add a regression guard

Define thresholds like:

  • accuracy must be >= X
  • needs_human must be <= Y (or within band)
  • invalid_json must be 0

If thresholds fail:

  • exit code non-zero
  • report says “REGRESSION”

This is how you stop shipping worse versions.

5) Add drift checks

Store historical stats from previous runs:

  • label distribution
  • avg confidence
  • needs_human rate

If distribution shifts hard, flag it:

  • “Possible drift: label X doubled”

No fancy math required for v1.


Deliverables (you must ship these)

Deliverable A — Eval runner

  • One command to run evaluation
  • Outputs metrics + report

Deliverable B — Regression guard

  • Thresholds defined
  • Build fails (or script fails) if regression detected

Deliverable C — Drift snapshot

  • A stored snapshot of label distribution and confidence
  • A comparison to previous snapshot

Deliverable D — Short report

  • Accuracy + needs_human
  • Top confusions
  • What to fix next

Common traps (don’t do this)

Later means never. Eval is the foundation.

  • Trap 1: “I’ll evaluate later.”

In ops, safe fallback matters too.

  • Trap 2: “Only accuracy matters.”

If golden set changes every week, your metrics are meaningless.

  • Trap 3: “Golden set keeps changing.”

Quick self-check (2 minutes)

Answer yes/no:

  • Do I have a stable golden set?
  • Can I run eval in one command?
  • Do I see confusion pairs and examples?
  • Do I have a regression threshold that fails builds?
  • Do I store drift snapshots over time?

If any “no” — fix it before moving on.


Next module: W27–W28W27–W28: Operational Metrics & Reporting