Phase 2 · W15–W16

W15–W16: Observability (logs, metrics, alerts, dashboards)

Build practical observability so you can detect failures early, debug fast, and stop guessing.

Suggested time: 4–6 hours/week

Outcomes

Structured logs you can actually search.
A minimal metrics set that describes pipeline health.
A simple dashboard view for quick status checks.
Basic alerts for major failures and missing successful runs.
Run summaries that make failed-run debugging fast.

Deliverables

Structured logs with run_id, step, status, rows, and duration.
Run summary storage you can query for the last 10 runs.
One dashboard/report view that shows pipeline health in 30 seconds.
At least 2 basic alert conditions for high-impact failures.

Prerequisites

W13–W14: Automation & Scheduling (jobs, retries, idempotency)

W15–W16: Observability (logs, metrics, alerts, dashboards)

What you’re doing

You stop guessing.

Observability is how you go from:

“users complain”
“I saw the issue before anyone reported it”

In SAP support reality, this is a superpower:

you catch broken runs early
you see drift in data quality
you detect interface issues before they explode

Time: 4–6 hours/week
Output: structured logs + a small metrics set + one dashboard view + basic alert rules

The promise (what you’ll have by the end)

By the end of W16 you will have:

Structured logs you can actually search
A minimal metrics set that describes pipeline health
A simple dashboard (even if it’s a markdown report + charts)
Basic alerts for the “big failures”
A run summary that makes debugging fast

The rule: measure what hurts

Don’t measure 100 things.
Measure the 10 things that will save your week.

The “must-have” observability set

Logs (structured)

Every run should log:

run_id
dataset
step name
start/end
rows in/out
error count
duration
status

If your logs don’t include these, they’re basically decorative.

Metrics (minimal but real)

Track:

pipeline_run_success_total
pipeline_run_failed_total
pipeline_duration_seconds
extracted_rows_total
normalized_rows_total
dq_errors_total (by rule_id)
db_upserts_total
last_success_timestamp

Even if you store these in a DB table first — that’s fine.

Alerts (basic)

Alert on:

pipeline failed
no successful run in X hours/days
dq errors spike above threshold
extraction rows drop to near zero (silent failure)
runtime doubled (something is stuck)

Start with simple thresholds. Perfection is not required.

Dashboard (something you can check in 30 seconds)

Create one place where you see:

last run status
last success time
rows processed
top DQ errors
runtime trend (optional)

It can be:

a simple web page in your app
a generated report
even a markdown file with numbers

The point is: fast visibility.

Step-by-step checklist

1) Make logs structured

Stop printing random strings.
Use structured logs (JSON-like fields).

Minimum: always include run_id + step + status.

2) Store run summaries

Create a run summary table/file:

run_id
started_at
finished_at
status
rows extracted/normalized/loaded
dq errors
duration

This becomes your “black box recorder”.

3) Collect metrics from the run summary

You can generate metrics from the run summary table.
Don’t invent a complicated monitoring stack.
Start with “store facts” and build on top.

4) Build one dashboard view

A single view/page/report that shows:

last 10 runs
top DQ errors
last success time
trend of rows

Make it ugly but useful.

5) Add basic alerts

If you have GitHub Actions schedule, alerts can be:

If you have nothing, at least:

failing workflow + email notification
exit code + a log summary file
and a manual check list

Start simple. Add fancy later.

Deliverables (you must ship these)

Deliverable A — Structured logs

Logs include run_id, step, status, rows, duration

Deliverable B — Run summary storage

A run summary exists (DB table or file)
You can list the last 10 runs quickly

Deliverable C — Dashboard view

A page/report exists where pipeline health is visible in 30 seconds

Deliverable D — Basic alerts

At least 2 alert conditions implemented:
pipeline failure
no successful run in X time

Common traps (don’t do this)

Not yet. Store run facts first. Dashboards later.

Trap 1: “I need Prometheus/Grafana now.”

Logs are not enough. Metrics + summaries give you trends.

Trap 2: “Logs are enough.”

No. Alert only on things that cost you real pain.

Trap 3: “I’ll alert on everything.”

Quick self-check (2 minutes)

Answer yes/no:

Can I see last run status in 10 seconds?
Can I see last success time instantly?
Do I know how many rows processed and how many DQ errors happened?
Do I have at least 2 alerts that catch big failures?
Can I debug a failed run with run_id + step logs?

If any “no” — fix it before moving on.

Next module preview (W17–W18)

Next we start Phase 3: Ticket Data Modeling & Labeling.
We’ll build the foundation for AI Ticket Analyzer — but with real evaluation, not vibes.

Next module: W17–W18 — W17–W18: Ticket Data Modeling & Labeling