Phase 4 · W35–W36

W35–W36: Runbooks, RCA, and Standard Operating Procedures

Create practical runbooks, RCAs, and SOPs that turn tribal knowledge into consistent, retrievable operational execution.

Suggested time: 4–6 hours/week

Outcomes

  • 5–10 runbooks for common issues.
  • 2–3 RCA documents for real recurring problems.
  • 3–5 SOPs that standardize support behavior.
  • A doc template everyone can follow.
  • KB content that retrieval can actually find and cite.

Deliverables

  • Runbook, RCA, and SOP templates.
  • Runbooks pack with consistent metadata and structure.
  • RCA pack with evidence and prevention actions.
  • SOP pack that standardizes decisions and flow.

Prerequisites

  • W33–W34: Guardrails & Governance (what is allowed, what is not)

W35–W36: Runbooks, RCA, and Standard Operating Procedures

What you’re doing

You stop having “tribal knowledge”.

RAG is useless if your knowledge base is empty or garbage.
So now you produce the content that actually saves time in AMS:

  • runbooks (how to fix)
  • RCA (why it happened)
  • SOP (how we work consistently)

Time: 4–6 hours/week
Output: a small, high-quality knowledge base pack: runbooks + RCA templates + SOPs, all consistent and ingestible


The promise (what you’ll have by the end)

By the end of W36 you will have:

  • 5–10 runbooks for common issues
  • 2–3 RCA documents for real recurring problems
  • 3–5 SOPs that standardize support behavior
  • A doc template everyone can follow
  • KB content that retrieval can actually find and cite

The rule: write for the tired engineer at 2am

A runbook should work when:

  • you’re tired
  • you have 10 tabs open
  • production is burning
  • and you need steps, not theory

What to create (keep scope realistic)

1) Runbooks (start with high-frequency pain)

Pick topics like:

  • “MDG overwrite: field reset” (how to detect + mitigate)
  • “Interface inbound failed: reprocess steps”
  • “Partner functions missing: impact + fix”
  • “Postal code/VAT validation failures”
  • “Mapping missing: how to confirm + update”
  • “Authorization issue: how to gather proof + request roles”

Each runbook should include:

  • When to use
  • Symptoms
  • Preconditions
  • Step-by-step actions
  • Expected result
  • Rollback / safety notes
  • Common failures
  • References

2) RCA documents (why it keeps happening)

Pick 2–3 recurring issues (your clusters help).
RCA format:

  • Impact
  • Timeline
  • Root cause
  • Contributing factors
  • Fix (short term)
  • Prevention (long term)
  • Evidence (logs, counts, links)
  • Owner + next review date

No blame. Only cause and prevention.

3) SOPs (how your support machine works)

Examples:

  • “Triage SOP: how to label and route tickets”
  • “Data fix SOP: where fixes are allowed (AFS/MDG/S4)”
  • “Release SOP: how changes are transported and tested”
  • “RCA SOP: when to do RCA and how”

SOPs make your system consistent.


Step-by-step checklist

1) Create templates first

Before writing 10 docs, create 3 templates:

  • Runbook template
  • RCA template
  • SOP template

Use your KB doc standard from W29–W30.

2) Write 5 runbooks (minimum)

Don’t aim for 50.
Aim for 5 that save pain every week.

3) Write 2 RCA docs

Use your ticket clusters as evidence.
Include numbers. Don’t write fiction.

4) Write 3 SOPs

Focus on “how we decide”.
Not “how SAP works”.

5) Ingest and test retrieval

After you write docs:

Make sure KB can find these docs.

  • run ingestion
  • run retrieval tests (W31–W32)

If retrieval can’t find them, fix titles/tags/metadata.


Deliverables (you must ship these)

Deliverable A — Templates

  • runbook template
  • rca template
  • sop template

Deliverable B — Runbooks pack

  • 5 runbooks minimum
  • consistent metadata + structure

Deliverable C — RCA pack

  • 2 RCAs with evidence and prevention actions

Deliverable D — SOP pack

  • 3 SOPs that standardize decisions and flow

Common traps (don’t do this)

Later means your KB stays empty forever.

  • Trap 1: “I’ll write docs later.”

RCA without evidence is fanfiction.

  • Trap 2: “Docs without evidence.”

Better 5 good runbooks than 50 useless pages.

  • Trap 3: “Too many docs, low quality.”

Quick self-check (2 minutes)

Answer yes/no:

  • Do my runbooks have step-by-step actions + expected results?
  • Do my RCAs include evidence and prevention?
  • Do my SOPs define decisions and ownership?
  • Is metadata consistent so retrieval can find docs?
  • Did I test retrieval on these docs?

If any “no” — fix it before moving on.