Phase 2 · W13–W14

W13–W14: Automation & Scheduling (jobs, retries, idempotency)

Build a scheduled, resilient pipeline with retries and idempotent execution you can run and trust every day.

Suggested time: 4–6 hours/week

Outcomes

A full pipeline runner executes extraction through reporting in a clear sequence.
Every run is traceable with run_id, timestamps, status, and dataset scope.
Retry logic is applied only to transient failures like network or timeout issues.
Idempotent behavior prevents duplicate records and inconsistent rerun outputs.
A documented schedule runs the pipeline automatically without manual reminders.

Deliverables

One-command pipeline runner with explicit step boundaries.
Run history with run_id and success/fail visibility.
Retries and idempotency protections wired into pipeline execution.
Schedule configuration documented in the project README.

Prerequisites

W11–W12: Storage Layer (Postgres schema, migrations)

W13–W14: Automation & Scheduling (jobs, retries, idempotency)

What you’re doing

You stop running your pipeline “when you remember”.

This is the moment your project becomes a system:

it runs on schedule
it survives failures
it can be rerun safely
it produces consistent results

Because real SAP work is not “run script once”.
It’s “run it forever”.

Time: 4–6 hours/week
Output: a scheduled pipeline run with retries, idempotent steps, and a clear run history

The promise (what you’ll have by the end)

By the end of W14 you will have:

A job runner that triggers extraction → normalize → DQ → load → report
A run history (run_id, status, timestamps)
Retries for transient failures (network, timeouts)
Idempotency guarantees (rerun doesn’t break or duplicate)
A simple “manual run” mode (for debugging)
A basic failure notification (even if it’s just a log + exit code)

The rule: every run must be explainable

You should be able to answer:

what ran?
when?
with which inputs?
what changed?
what failed?
what was produced?

If you can’t explain it, it’s not a system.

Step-by-step checklist

1) Define the pipeline steps (explicit)

Write the pipeline as steps, in order:

extract
normalize
dq + mapping
load to Postgres
generate report

Don’t hide steps inside “one huge function”.
Each step should be runnable independently.

2) Introduce a run_id

Every pipeline run gets:

run_id (uuid)
started_at / finished_at
status (success/fail/partial)
dataset name(s)

Store it in:

a DB table, or
a local run log file

Minimum: a JSON log per run.

3) Make steps idempotent

Idempotent means:

rerun produces same output
no duplicates
no corrupted state

Common tactics:

write outputs to run-specific folders
use upsert in DB
keep record_hash for dedup
don’t “append forever” blindly

4) Add retries (but don’t be stupid)

Retry only transient things:

HTTP calls
DB connection
file downloads

Do not retry:

Those should fail fast.

schema errors
invalid data errors

5) Add scheduling

Pick the simplest option that fits your setup:

cron locally
GitHub Actions schedule
a tiny scheduler inside the app (only if needed)

Start small:

daily run
manual run anytime

6) Add a minimal failure signal

If a run fails:

exit code is non-zero
report says “FAILED” and why
logs include run_id and error summary

Bonus (optional):

But do not block on this.

send a notification (email/Slack) if your environment supports it

Deliverables (you must ship these)

Deliverable A — Pipeline runner

One command to run full pipeline end-to-end
Steps are clearly separated

Deliverable B — Run history

Each run has run_id and status
You can find last successful run quickly

Deliverable C — Retries + idempotency

Retries exist for transient failures
Rerun does not duplicate DB records or raw artifacts

Deliverable D — Scheduled execution

A schedule exists (cron or GitHub Actions or equivalent)
Documented in README

Common traps (don’t do this)

No. Simple runner + schedule first. Complexity later.

Trap 1: “I’ll build a full orchestrator.”

No. Retry only transient failures.

Trap 2: “Retries everywhere.”

No. Without idempotency your pipeline will rot and you’ll stop trusting it.

Trap 3: “Idempotency is overkill.”

Quick self-check (2 minutes)

Answer yes/no:

Can I run the whole pipeline with one command?
Do I have a run_id and run report for every run?
Can I rerun safely without duplicates?
Do retries exist only where appropriate?
Is there a schedule that runs without me remembering?

If any “no” — fix it before moving on.

Next module: W15–W16 — W15–W16: Observability (logs, metrics, alerts, dashboards)