Phase 2 · W15–W16

W15–W16: Observability (logs, metrics, alerts, dashboards)

Build practical observability so you can detect failures early, debug fast, and stop guessing.

Suggested time: 4–6 hours/week

Outcomes

  • Structured logs you can actually search.
  • A minimal metrics set that describes pipeline health.
  • A simple dashboard view for quick status checks.
  • Basic alerts for major failures and missing successful runs.
  • Run summaries that make failed-run debugging fast.

Deliverables

  • Structured logs with run_id, step, status, rows, and duration.
  • Run summary storage you can query for the last 10 runs.
  • One dashboard/report view that shows pipeline health in 30 seconds.
  • At least 2 basic alert conditions for high-impact failures.

Prerequisites

  • W13–W14: Automation & Scheduling (jobs, retries, idempotency)

W15–W16: Observability (logs, metrics, alerts, dashboards)

What you’re doing

You stop guessing.

Observability is how you go from:

to

  • “users complain”
  • “I saw the issue before anyone reported it”

In SAP support reality, this is a superpower:

  • you catch broken runs early
  • you see drift in data quality
  • you detect interface issues before they explode

Time: 4–6 hours/week
Output: structured logs + a small metrics set + one dashboard view + basic alert rules


The promise (what you’ll have by the end)

By the end of W16 you will have:

  • Structured logs you can actually search
  • A minimal metrics set that describes pipeline health
  • A simple dashboard (even if it’s a markdown report + charts)
  • Basic alerts for the “big failures”
  • A run summary that makes debugging fast

The rule: measure what hurts

Don’t measure 100 things.
Measure the 10 things that will save your week.


The “must-have” observability set

Logs (structured)

Every run should log:

  • run_id
  • dataset
  • step name
  • start/end
  • rows in/out
  • error count
  • duration
  • status

If your logs don’t include these, they’re basically decorative.

Metrics (minimal but real)

Track:

  • pipeline_run_success_total
  • pipeline_run_failed_total
  • pipeline_duration_seconds
  • extracted_rows_total
  • normalized_rows_total
  • dq_errors_total (by rule_id)
  • db_upserts_total
  • last_success_timestamp

Even if you store these in a DB table first — that’s fine.

Alerts (basic)

Alert on:

  • pipeline failed
  • no successful run in X hours/days
  • dq errors spike above threshold
  • extraction rows drop to near zero (silent failure)
  • runtime doubled (something is stuck)

Start with simple thresholds. Perfection is not required.

Dashboard (something you can check in 30 seconds)

Create one place where you see:

  • last run status
  • last success time
  • rows processed
  • top DQ errors
  • runtime trend (optional)

It can be:

  • a simple web page in your app
  • a generated report
  • even a markdown file with numbers

The point is: fast visibility.


Step-by-step checklist

1) Make logs structured

Stop printing random strings.
Use structured logs (JSON-like fields).

Minimum: always include run_id + step + status.

2) Store run summaries

Create a run summary table/file:

  • run_id
  • started_at
  • finished_at
  • status
  • rows extracted/normalized/loaded
  • dq errors
  • duration

This becomes your “black box recorder”.

3) Collect metrics from the run summary

You can generate metrics from the run summary table.
Don’t invent a complicated monitoring stack.
Start with “store facts” and build on top.

4) Build one dashboard view

A single view/page/report that shows:

  • last 10 runs
  • top DQ errors
  • last success time
  • trend of rows

Make it ugly but useful.

5) Add basic alerts

If you have GitHub Actions schedule, alerts can be:

If you have nothing, at least:

  • failing workflow + email notification
  • exit code + a log summary file
  • and a manual check list

Start simple. Add fancy later.


Deliverables (you must ship these)

Deliverable A — Structured logs

  • Logs include run_id, step, status, rows, duration

Deliverable B — Run summary storage

  • A run summary exists (DB table or file)
  • You can list the last 10 runs quickly

Deliverable C — Dashboard view

  • A page/report exists where pipeline health is visible in 30 seconds

Deliverable D — Basic alerts

  • At least 2 alert conditions implemented:
  • pipeline failure
  • no successful run in X time

Common traps (don’t do this)

Not yet. Store run facts first. Dashboards later.

  • Trap 1: “I need Prometheus/Grafana now.”

Logs are not enough. Metrics + summaries give you trends.

  • Trap 2: “Logs are enough.”

No. Alert only on things that cost you real pain.

  • Trap 3: “I’ll alert on everything.”

Quick self-check (2 minutes)

Answer yes/no:

  • Can I see last run status in 10 seconds?
  • Can I see last success time instantly?
  • Do I know how many rows processed and how many DQ errors happened?
  • Do I have at least 2 alerts that catch big failures?
  • Can I debug a failed run with run_id + step logs?

If any “no” — fix it before moving on.


Next module preview (W17–W18)

Next we start Phase 3: Ticket Data Modeling & Labeling.
We’ll build the foundation for AI Ticket Analyzer — but with real evaluation, not vibes.

Next module: W17–W18W17–W18: Ticket Data Modeling & Labeling