Phase 3 · W19–W20

W19–W20: Prompting Patterns for Ops (safe prompts, constraints)

Make LLM outputs predictable, constrained, and parseable so they are actually usable in operations.

Suggested time: 4–6 hours/week

Outcomes

  • 3–5 prompt templates that solve real ops tasks.
  • A strict output schema (JSON) that your app can parse.
  • Guardrails: what the model is allowed to say and not say.
  • A fallback strategy (unknown / needs-human).
  • A mini evaluation run against your golden set.

Deliverables

  • 3 prompt templates in the repo with clear input/output format.
  • JSON schema or TS type for output with validation in code.
  • Guardrails doc with allowed labels and unknown behavior rules.
  • Mini evaluation report with accuracy, failures, and next improvements.

Prerequisites

  • W17–W18: Ticket Data Modeling & Labeling

W19–W20: Prompting Patterns for Ops (safe prompts, constraints)

What you’re doing

You’re making LLM output predictable enough to be used in operations.

The biggest mistake people make:

  • they ask the model “what do you think?”
  • get a nice paragraph
  • and call it automation

Ops needs:

  • structure
  • constraints
  • reproducibility
  • a clear “I don’t know” path

Time: 4–6 hours/week
Output: a set of prompt templates + output schemas + safety rules + a small eval routine using your golden set


The promise (what you’ll have by the end)

By the end of W20 you will have:

  • 3–5 prompt templates that solve real ops tasks
  • A strict output schema (JSON) that your app can parse
  • Guardrails: what the model is allowed to say and not say
  • A fallback strategy (unknown / needs-human)
  • A mini evaluation run against your golden set

The rule: no free-form output in production

If the output is not parseable, it’s not usable.

Your model output must be:

  • JSON
  • with fixed keys
  • with allowed values
  • with confidence and reasons

Build your prompt kit (simple but real)

1) Define tasks (pick 3)

Pick tasks that are actually useful:

  • classify ticket category
  • route to team (dev/data/config/manual)
  • suggest next steps checklist
  • detect duplicates / similar tickets
  • extract key fields (BP number, system, country, etc.)

Pick 3. Don’t do 10.

2) Define output schema (strict JSON)

Example keys:

  • primary_label
  • secondary_label (optional)
  • confidence (0–1)
  • extracted_fields (object)
  • suggested_next_steps (array)
  • needs_human (boolean)
  • reasons (array of short bullet strings)

Keep it stable.

3) Write the prompt template (with constraints)

Your prompt must include:

  • role (“You are an AMS triage assistant…”)
  • input format
  • allowed labels list
  • output JSON schema
  • instruction to say needs_human=true if unsure
  • “do not invent facts” rule

4) Add refusal / unknown behavior

If the ticket is missing info:

  • model must request missing fields
  • or set needs_human=true

No guessing.

5) Add a mini-eval using your golden set

Run your prompt on the golden set and compute:

  • accuracy of primary label
  • % needs_human
  • top failure patterns

This is how you stop lying to yourself.


Deliverables (you must ship these)

Deliverable A — Prompt templates

  • 3 prompt templates stored in repo (as .md or .txt)
  • Each has clear input/output format

Deliverable B — Output schema

  • JSON schema / TS type exists
  • Your code validates model output

Deliverable C — Guardrails doc

  • A short doc:
  • allowed labels
  • “do not invent facts”
  • when to set needs_human=true

Deliverable D — Mini evaluation report

  • run results on golden set:
  • accuracy
  • failures
  • what to improve next

Common traps (don’t do this)

Free answers are useless in ops.

  • Trap 1: “Let it answer freely.”

Without eval you are doing vibes, not engineering.

  • Trap 2: “No evaluation.”

If the model can’t say “I don’t know”, it will hallucinate.

  • Trap 3: “No unknown path.”

Quick self-check (2 minutes)

Answer yes/no:

  • Are outputs strict JSON with fixed keys?
  • Do I have allowed labels and constraints in the prompt?
  • Do I have a needs_human fallback path?
  • Did I run a mini-eval on my golden set?
  • Do I know my top 3 failure patterns?

If any “no” — fix it before moving on.


Next module: W21–W22W21–W22: Classification & Routing