Phase 2 · W15–W16
W15–W16: Observability (logs, metrics, alerts, dashboards)
Build practical observability so you can detect failures early, debug fast, and stop guessing.
Suggested time: 4–6 hours/week
Outcomes
- Structured logs you can actually search.
- A minimal metrics set that describes pipeline health.
- A simple dashboard view for quick status checks.
- Basic alerts for major failures and missing successful runs.
- Run summaries that make failed-run debugging fast.
Deliverables
- Structured logs with run_id, step, status, rows, and duration.
- Run summary storage you can query for the last 10 runs.
- One dashboard/report view that shows pipeline health in 30 seconds.
- At least 2 basic alert conditions for high-impact failures.
Prerequisites
- W13–W14: Automation & Scheduling (jobs, retries, idempotency)
W15–W16: Observability (logs, metrics, alerts, dashboards)
What you’re doing
You stop guessing.
Observability is how you go from:
to
- “users complain”
- “I saw the issue before anyone reported it”
In SAP support reality, this is a superpower:
- you catch broken runs early
- you see drift in data quality
- you detect interface issues before they explode
Time: 4–6 hours/week
Output: structured logs + a small metrics set + one dashboard view + basic alert rules
The promise (what you’ll have by the end)
By the end of W16 you will have:
- Structured logs you can actually search
- A minimal metrics set that describes pipeline health
- A simple dashboard (even if it’s a markdown report + charts)
- Basic alerts for the “big failures”
- A run summary that makes debugging fast
The rule: measure what hurts
Don’t measure 100 things.
Measure the 10 things that will save your week.
The “must-have” observability set
Logs (structured)
Every run should log:
- run_id
- dataset
- step name
- start/end
- rows in/out
- error count
- duration
- status
If your logs don’t include these, they’re basically decorative.
Metrics (minimal but real)
Track:
- pipeline_run_success_total
- pipeline_run_failed_total
- pipeline_duration_seconds
- extracted_rows_total
- normalized_rows_total
- dq_errors_total (by rule_id)
- db_upserts_total
- last_success_timestamp
Even if you store these in a DB table first — that’s fine.
Alerts (basic)
Alert on:
- pipeline failed
- no successful run in X hours/days
- dq errors spike above threshold
- extraction rows drop to near zero (silent failure)
- runtime doubled (something is stuck)
Start with simple thresholds. Perfection is not required.
Dashboard (something you can check in 30 seconds)
Create one place where you see:
- last run status
- last success time
- rows processed
- top DQ errors
- runtime trend (optional)
It can be:
- a simple web page in your app
- a generated report
- even a markdown file with numbers
The point is: fast visibility.
Step-by-step checklist
1) Make logs structured
Stop printing random strings.
Use structured logs (JSON-like fields).
Minimum: always include run_id + step + status.
2) Store run summaries
Create a run summary table/file:
- run_id
- started_at
- finished_at
- status
- rows extracted/normalized/loaded
- dq errors
- duration
This becomes your “black box recorder”.
3) Collect metrics from the run summary
You can generate metrics from the run summary table.
Don’t invent a complicated monitoring stack.
Start with “store facts” and build on top.
4) Build one dashboard view
A single view/page/report that shows:
- last 10 runs
- top DQ errors
- last success time
- trend of rows
Make it ugly but useful.
5) Add basic alerts
If you have GitHub Actions schedule, alerts can be:
If you have nothing, at least:
- failing workflow + email notification
- exit code + a log summary file
- and a manual check list
Start simple. Add fancy later.
Deliverables (you must ship these)
Deliverable A — Structured logs
- Logs include run_id, step, status, rows, duration
Deliverable B — Run summary storage
- A run summary exists (DB table or file)
- You can list the last 10 runs quickly
Deliverable C — Dashboard view
- A page/report exists where pipeline health is visible in 30 seconds
Deliverable D — Basic alerts
- At least 2 alert conditions implemented:
- pipeline failure
- no successful run in X time
Common traps (don’t do this)
Not yet. Store run facts first. Dashboards later.
- Trap 1: “I need Prometheus/Grafana now.”
Logs are not enough. Metrics + summaries give you trends.
- Trap 2: “Logs are enough.”
No. Alert only on things that cost you real pain.
- Trap 3: “I’ll alert on everything.”
Quick self-check (2 minutes)
Answer yes/no:
- Can I see last run status in 10 seconds?
- Can I see last success time instantly?
- Do I know how many rows processed and how many DQ errors happened?
- Do I have at least 2 alerts that catch big failures?
- Can I debug a failed run with run_id + step logs?
If any “no” — fix it before moving on.
Next module preview (W17–W18)
Next we start Phase 3: Ticket Data Modeling & Labeling.
We’ll build the foundation for AI Ticket Analyzer — but with real evaluation, not vibes.