Phase 2 · W13–W14

W13–W14: Automation & Scheduling (jobs, retries, idempotency)

Build a scheduled, resilient pipeline with retries and idempotent execution you can run and trust every day.

Suggested time: 4–6 hours/week

Outcomes

  • A full pipeline runner executes extraction through reporting in a clear sequence.
  • Every run is traceable with run_id, timestamps, status, and dataset scope.
  • Retry logic is applied only to transient failures like network or timeout issues.
  • Idempotent behavior prevents duplicate records and inconsistent rerun outputs.
  • A documented schedule runs the pipeline automatically without manual reminders.

Deliverables

  • One-command pipeline runner with explicit step boundaries.
  • Run history with run_id and success/fail visibility.
  • Retries and idempotency protections wired into pipeline execution.
  • Schedule configuration documented in the project README.

Prerequisites

  • W11–W12: Storage Layer (Postgres schema, migrations)

W13–W14: Automation & Scheduling (jobs, retries, idempotency)

What you’re doing

You stop running your pipeline “when you remember”.

This is the moment your project becomes a system:

  • it runs on schedule
  • it survives failures
  • it can be rerun safely
  • it produces consistent results

Because real SAP work is not “run script once”.
It’s “run it forever”.

Time: 4–6 hours/week
Output: a scheduled pipeline run with retries, idempotent steps, and a clear run history


The promise (what you’ll have by the end)

By the end of W14 you will have:

  • A job runner that triggers extraction → normalize → DQ → load → report
  • A run history (run_id, status, timestamps)
  • Retries for transient failures (network, timeouts)
  • Idempotency guarantees (rerun doesn’t break or duplicate)
  • A simple “manual run” mode (for debugging)
  • A basic failure notification (even if it’s just a log + exit code)

The rule: every run must be explainable

You should be able to answer:

  • what ran?
  • when?
  • with which inputs?
  • what changed?
  • what failed?
  • what was produced?

If you can’t explain it, it’s not a system.


Step-by-step checklist

1) Define the pipeline steps (explicit)

Write the pipeline as steps, in order:

  1. extract
  2. normalize
  3. dq + mapping
  4. load to Postgres
  5. generate report

Don’t hide steps inside “one huge function”.
Each step should be runnable independently.

2) Introduce a run_id

Every pipeline run gets:

  • run_id (uuid)
  • started_at / finished_at
  • status (success/fail/partial)
  • dataset name(s)

Store it in:

  • a DB table, or
  • a local run log file

Minimum: a JSON log per run.

3) Make steps idempotent

Idempotent means:

  • rerun produces same output
  • no duplicates
  • no corrupted state

Common tactics:

  • write outputs to run-specific folders
  • use upsert in DB
  • keep record_hash for dedup
  • don’t “append forever” blindly

4) Add retries (but don’t be stupid)

Retry only transient things:

  • HTTP calls
  • DB connection
  • file downloads

Do not retry:

Those should fail fast.

  • schema errors
  • invalid data errors

5) Add scheduling

Pick the simplest option that fits your setup:

  • cron locally
  • GitHub Actions schedule
  • a tiny scheduler inside the app (only if needed)

Start small:

  • daily run
  • manual run anytime

6) Add a minimal failure signal

If a run fails:

  • exit code is non-zero
  • report says “FAILED” and why
  • logs include run_id and error summary

Bonus (optional):

But do not block on this.

  • send a notification (email/Slack) if your environment supports it

Deliverables (you must ship these)

Deliverable A — Pipeline runner

  • One command to run full pipeline end-to-end
  • Steps are clearly separated

Deliverable B — Run history

  • Each run has run_id and status
  • You can find last successful run quickly

Deliverable C — Retries + idempotency

  • Retries exist for transient failures
  • Rerun does not duplicate DB records or raw artifacts

Deliverable D — Scheduled execution

  • A schedule exists (cron or GitHub Actions or equivalent)
  • Documented in README

Common traps (don’t do this)

No. Simple runner + schedule first. Complexity later.

  • Trap 1: “I’ll build a full orchestrator.”

No. Retry only transient failures.

  • Trap 2: “Retries everywhere.”

No. Without idempotency your pipeline will rot and you’ll stop trusting it.

  • Trap 3: “Idempotency is overkill.”

Quick self-check (2 minutes)

Answer yes/no:

  • Can I run the whole pipeline with one command?
  • Do I have a run_id and run report for every run?
  • Can I rerun safely without duplicates?
  • Do retries exist only where appropriate?
  • Is there a schedule that runs without me remembering?

If any “no” — fix it before moving on.


Next module: W15–W16W15–W16: Observability (logs, metrics, alerts, dashboards)