Phase 4 · W35–W36
W35–W36: Runbooks, RCA, and Standard Operating Procedures
Create practical runbooks, RCAs, and SOPs that turn tribal knowledge into consistent, retrievable operational execution.
Suggested time: 4–6 hours/week
Outcomes
- 5–10 runbooks for common issues.
- 2–3 RCA documents for real recurring problems.
- 3–5 SOPs that standardize support behavior.
- A doc template everyone can follow.
- KB content that retrieval can actually find and cite.
Deliverables
- Runbook, RCA, and SOP templates.
- Runbooks pack with consistent metadata and structure.
- RCA pack with evidence and prevention actions.
- SOP pack that standardizes decisions and flow.
Prerequisites
- W33–W34: Guardrails & Governance (what is allowed, what is not)
W35–W36: Runbooks, RCA, and Standard Operating Procedures
What you’re doing
You stop having “tribal knowledge”.
RAG is useless if your knowledge base is empty or garbage.
So now you produce the content that actually saves time in AMS:
- runbooks (how to fix)
- RCA (why it happened)
- SOP (how we work consistently)
Time: 4–6 hours/week
Output: a small, high-quality knowledge base pack: runbooks + RCA templates + SOPs, all consistent and ingestible
The promise (what you’ll have by the end)
By the end of W36 you will have:
- 5–10 runbooks for common issues
- 2–3 RCA documents for real recurring problems
- 3–5 SOPs that standardize support behavior
- A doc template everyone can follow
- KB content that retrieval can actually find and cite
The rule: write for the tired engineer at 2am
A runbook should work when:
- you’re tired
- you have 10 tabs open
- production is burning
- and you need steps, not theory
What to create (keep scope realistic)
1) Runbooks (start with high-frequency pain)
Pick topics like:
- “MDG overwrite: field reset” (how to detect + mitigate)
- “Interface inbound failed: reprocess steps”
- “Partner functions missing: impact + fix”
- “Postal code/VAT validation failures”
- “Mapping missing: how to confirm + update”
- “Authorization issue: how to gather proof + request roles”
Each runbook should include:
- When to use
- Symptoms
- Preconditions
- Step-by-step actions
- Expected result
- Rollback / safety notes
- Common failures
- References
2) RCA documents (why it keeps happening)
Pick 2–3 recurring issues (your clusters help).
RCA format:
- Impact
- Timeline
- Root cause
- Contributing factors
- Fix (short term)
- Prevention (long term)
- Evidence (logs, counts, links)
- Owner + next review date
No blame. Only cause and prevention.
3) SOPs (how your support machine works)
Examples:
- “Triage SOP: how to label and route tickets”
- “Data fix SOP: where fixes are allowed (AFS/MDG/S4)”
- “Release SOP: how changes are transported and tested”
- “RCA SOP: when to do RCA and how”
SOPs make your system consistent.
Step-by-step checklist
1) Create templates first
Before writing 10 docs, create 3 templates:
- Runbook template
- RCA template
- SOP template
Use your KB doc standard from W29–W30.
2) Write 5 runbooks (minimum)
Don’t aim for 50.
Aim for 5 that save pain every week.
3) Write 2 RCA docs
Use your ticket clusters as evidence.
Include numbers. Don’t write fiction.
4) Write 3 SOPs
Focus on “how we decide”.
Not “how SAP works”.
5) Ingest and test retrieval
After you write docs:
Make sure KB can find these docs.
- run ingestion
- run retrieval tests (W31–W32)
If retrieval can’t find them, fix titles/tags/metadata.
Deliverables (you must ship these)
Deliverable A — Templates
- runbook template
- rca template
- sop template
Deliverable B — Runbooks pack
- 5 runbooks minimum
- consistent metadata + structure
Deliverable C — RCA pack
- 2 RCAs with evidence and prevention actions
Deliverable D — SOP pack
- 3 SOPs that standardize decisions and flow
Common traps (don’t do this)
Later means your KB stays empty forever.
- Trap 1: “I’ll write docs later.”
RCA without evidence is fanfiction.
- Trap 2: “Docs without evidence.”
Better 5 good runbooks than 50 useless pages.
- Trap 3: “Too many docs, low quality.”
Quick self-check (2 minutes)
Answer yes/no:
- Do my runbooks have step-by-step actions + expected results?
- Do my RCAs include evidence and prevention?
- Do my SOPs define decisions and ownership?
- Is metadata consistent so retrieval can find docs?
- Did I test retrieval on these docs?
If any “no” — fix it before moving on.