Phase 3 · W19–W20
W19–W20: Prompting Patterns for Ops (safe prompts, constraints)
Make LLM outputs predictable, constrained, and parseable so they are actually usable in operations.
Suggested time: 4–6 hours/week
Outcomes
- 3–5 prompt templates that solve real ops tasks.
- A strict output schema (JSON) that your app can parse.
- Guardrails: what the model is allowed to say and not say.
- A fallback strategy (unknown / needs-human).
- A mini evaluation run against your golden set.
Deliverables
- 3 prompt templates in the repo with clear input/output format.
- JSON schema or TS type for output with validation in code.
- Guardrails doc with allowed labels and unknown behavior rules.
- Mini evaluation report with accuracy, failures, and next improvements.
Prerequisites
- W17–W18: Ticket Data Modeling & Labeling
W19–W20: Prompting Patterns for Ops (safe prompts, constraints)
What you’re doing
You’re making LLM output predictable enough to be used in operations.
The biggest mistake people make:
- they ask the model “what do you think?”
- get a nice paragraph
- and call it automation
Ops needs:
- structure
- constraints
- reproducibility
- a clear “I don’t know” path
Time: 4–6 hours/week
Output: a set of prompt templates + output schemas + safety rules + a small eval routine using your golden set
The promise (what you’ll have by the end)
By the end of W20 you will have:
- 3–5 prompt templates that solve real ops tasks
- A strict output schema (JSON) that your app can parse
- Guardrails: what the model is allowed to say and not say
- A fallback strategy (unknown / needs-human)
- A mini evaluation run against your golden set
The rule: no free-form output in production
If the output is not parseable, it’s not usable.
Your model output must be:
- JSON
- with fixed keys
- with allowed values
- with confidence and reasons
Build your prompt kit (simple but real)
1) Define tasks (pick 3)
Pick tasks that are actually useful:
- classify ticket category
- route to team (dev/data/config/manual)
- suggest next steps checklist
- detect duplicates / similar tickets
- extract key fields (BP number, system, country, etc.)
Pick 3. Don’t do 10.
2) Define output schema (strict JSON)
Example keys:
- primary_label
- secondary_label (optional)
- confidence (0–1)
- extracted_fields (object)
- suggested_next_steps (array)
- needs_human (boolean)
- reasons (array of short bullet strings)
Keep it stable.
3) Write the prompt template (with constraints)
Your prompt must include:
- role (“You are an AMS triage assistant…”)
- input format
- allowed labels list
- output JSON schema
- instruction to say needs_human=true if unsure
- “do not invent facts” rule
4) Add refusal / unknown behavior
If the ticket is missing info:
- model must request missing fields
- or set needs_human=true
No guessing.
5) Add a mini-eval using your golden set
Run your prompt on the golden set and compute:
- accuracy of primary label
- % needs_human
- top failure patterns
This is how you stop lying to yourself.
Deliverables (you must ship these)
Deliverable A — Prompt templates
- 3 prompt templates stored in repo (as .md or .txt)
- Each has clear input/output format
Deliverable B — Output schema
- JSON schema / TS type exists
- Your code validates model output
Deliverable C — Guardrails doc
- A short doc:
- allowed labels
- “do not invent facts”
- when to set needs_human=true
Deliverable D — Mini evaluation report
- run results on golden set:
- accuracy
- failures
- what to improve next
Common traps (don’t do this)
Free answers are useless in ops.
- Trap 1: “Let it answer freely.”
Without eval you are doing vibes, not engineering.
- Trap 2: “No evaluation.”
If the model can’t say “I don’t know”, it will hallucinate.
- Trap 3: “No unknown path.”
Quick self-check (2 minutes)
Answer yes/no:
- Are outputs strict JSON with fixed keys?
- Do I have allowed labels and constraints in the prompt?
- Do I have a needs_human fallback path?
- Did I run a mini-eval on my golden set?
- Do I know my top 3 failure patterns?
If any “no” — fix it before moving on.