Phase 2 · W13–W14
W13–W14: Automation & Scheduling (jobs, retries, idempotency)
Build a scheduled, resilient pipeline with retries and idempotent execution you can run and trust every day.
Suggested time: 4–6 hours/week
Outcomes
- A full pipeline runner executes extraction through reporting in a clear sequence.
- Every run is traceable with run_id, timestamps, status, and dataset scope.
- Retry logic is applied only to transient failures like network or timeout issues.
- Idempotent behavior prevents duplicate records and inconsistent rerun outputs.
- A documented schedule runs the pipeline automatically without manual reminders.
Deliverables
- One-command pipeline runner with explicit step boundaries.
- Run history with run_id and success/fail visibility.
- Retries and idempotency protections wired into pipeline execution.
- Schedule configuration documented in the project README.
Prerequisites
- W11–W12: Storage Layer (Postgres schema, migrations)
W13–W14: Automation & Scheduling (jobs, retries, idempotency)
What you’re doing
You stop running your pipeline “when you remember”.
This is the moment your project becomes a system:
- it runs on schedule
- it survives failures
- it can be rerun safely
- it produces consistent results
Because real SAP work is not “run script once”.
It’s “run it forever”.
Time: 4–6 hours/week
Output: a scheduled pipeline run with retries, idempotent steps, and a clear run history
The promise (what you’ll have by the end)
By the end of W14 you will have:
- A job runner that triggers extraction → normalize → DQ → load → report
- A run history (run_id, status, timestamps)
- Retries for transient failures (network, timeouts)
- Idempotency guarantees (rerun doesn’t break or duplicate)
- A simple “manual run” mode (for debugging)
- A basic failure notification (even if it’s just a log + exit code)
The rule: every run must be explainable
You should be able to answer:
- what ran?
- when?
- with which inputs?
- what changed?
- what failed?
- what was produced?
If you can’t explain it, it’s not a system.
Step-by-step checklist
1) Define the pipeline steps (explicit)
Write the pipeline as steps, in order:
- extract
- normalize
- dq + mapping
- load to Postgres
- generate report
Don’t hide steps inside “one huge function”.
Each step should be runnable independently.
2) Introduce a run_id
Every pipeline run gets:
- run_id (uuid)
- started_at / finished_at
- status (success/fail/partial)
- dataset name(s)
Store it in:
- a DB table, or
- a local run log file
Minimum: a JSON log per run.
3) Make steps idempotent
Idempotent means:
- rerun produces same output
- no duplicates
- no corrupted state
Common tactics:
- write outputs to run-specific folders
- use upsert in DB
- keep record_hash for dedup
- don’t “append forever” blindly
4) Add retries (but don’t be stupid)
Retry only transient things:
- HTTP calls
- DB connection
- file downloads
Do not retry:
Those should fail fast.
- schema errors
- invalid data errors
5) Add scheduling
Pick the simplest option that fits your setup:
- cron locally
- GitHub Actions schedule
- a tiny scheduler inside the app (only if needed)
Start small:
- daily run
- manual run anytime
6) Add a minimal failure signal
If a run fails:
- exit code is non-zero
- report says “FAILED” and why
- logs include run_id and error summary
Bonus (optional):
But do not block on this.
- send a notification (email/Slack) if your environment supports it
Deliverables (you must ship these)
Deliverable A — Pipeline runner
- One command to run full pipeline end-to-end
- Steps are clearly separated
Deliverable B — Run history
- Each run has run_id and status
- You can find last successful run quickly
Deliverable C — Retries + idempotency
- Retries exist for transient failures
- Rerun does not duplicate DB records or raw artifacts
Deliverable D — Scheduled execution
- A schedule exists (cron or GitHub Actions or equivalent)
- Documented in README
Common traps (don’t do this)
No. Simple runner + schedule first. Complexity later.
- Trap 1: “I’ll build a full orchestrator.”
No. Retry only transient failures.
- Trap 2: “Retries everywhere.”
No. Without idempotency your pipeline will rot and you’ll stop trusting it.
- Trap 3: “Idempotency is overkill.”
Quick self-check (2 minutes)
Answer yes/no:
- Can I run the whole pipeline with one command?
- Do I have a run_id and run report for every run?
- Can I rerun safely without duplicates?
- Do retries exist only where appropriate?
- Is there a schedule that runs without me remembering?
If any “no” — fix it before moving on.