Phase 2 · W7–W8
W7–W8: SAP Extraction Patterns (OData / files / exports)
Build one extraction flow that is predictable, observable, and resumable instead of fragile.
Suggested time: 4–6 hours/week
Outcomes
- One working extractor (pick one method: OData OR files).
- A raw data storage pattern (so you never lose the original truth).
- A normalization step (raw → clean tables).
- A delta strategy (even if simple at first).
- Logging and basic metrics (so you can debug fast).
- A “failure plan” (retries, resume, partial loads).
Deliverables
- A script or service that pulls data (OData OR file ingestion) and writes raw artifacts to disk with run_id.
- A step that converts raw → clean output (JSONL or staging table).
- A generated markdown file or console summary with rows fetched, rows normalized, top errors, and runtime.
- README updated with extraction run steps, raw storage path, and normalization behavior.
Prerequisites
- W6: Packaging, Testing Baseline, and “Definition of Done”
W7–W8: SAP Extraction Patterns (OData / files / exports)
What you’re doing
You’re learning how to pull SAP data without turning your life into a support nightmare.
Extraction is where most “data projects” die:
- wrong deltas
- timeouts
- missing authorizations
- broken formats
- silent truncation
- “it worked in QA”
So we build extraction like an engineer: predictable, observable, and resumable.
Time: 4–6 hours/week
Output: one extraction pipeline that can fetch data (OData or file export), store raw snapshots, and produce a clean normalized output
The promise (what you’ll have by the end)
By the end of W8 you will have:
- One working extractor (pick one method: OData OR files)
- A raw data storage pattern (so you never lose the original truth)
- A normalization step (raw → clean tables)
- A delta strategy (even if simple at first)
- Logging and basic metrics (so you can debug fast)
- A “failure plan” (retries, resume, partial loads)
Pick ONE extraction method (don’t be greedy)
Choose based on your reality:
Option A — OData
Best when:
- you have stable services
- you can filter by date/changed-on
- you want structured JSON
Option B — Files / Exports
Best when:
- the org exports CSV/IDoc dumps already
- OData is slow or blocked
- you can get scheduled drops (SFTP / Share / email attachments — yes, real life)
Pick ONE now. You can add the second later.
Your extraction “golden rules”
- Always store raw data first.
- Never trust deltas blindly.
- Every run has a run_id and a timestamp.
- Failures must be restartable.
- Logs must tell you what happened.
Step-by-step checklist
1) Define the dataset
Write down:
- object (BP / addresses / partner functions / etc.)
- fields
- expected volume
- delta key (changed_on, or something else)
- file format (if file-based)
Keep it on one page.
2) Build raw storage
Create a folder like:
- data/raw/<dataset>/<date>/<run_id>.json (or .csv)
Raw means:
- unmodified
- exactly what you got from SAP/export
This is your insurance.
3) Build normalization (raw → clean)
Normalize into:
- a clean JSONL file, or
- Postgres staging table
Goal:
- consistent columns
- types fixed
- missing values handled
- duplicates flagged (not hidden)
4) Add delta strategy (simple but honest)
Start simple:
or
- full load weekly + daily small incremental
- incremental by changed_on with a safety overlap window (e.g., last 3 days)
Important:
- overlap is normal
- dedup is mandatory
5) Add observability
Minimum logs:
- dataset name
- run_id
- start/end time
- rows fetched
- rows normalized
- errors count
If you can’t answer “what happened in the last run?” you will suffer.
6) Add failure handling
Minimum:
- retries for network
- resume from last successful page/file
- keep partial raw artifacts (don’t delete)
Deliverables (you must ship these)
Deliverable A — Extractor
- A script or service that pulls data (OData OR file ingestion)
- It writes raw artifacts to disk with run_id
Deliverable B — Normalizer
- A step that converts raw → clean output (JSONL or staging table)
Deliverable C — Run report (tiny but real)
- A generated markdown file or console summary:
- rows fetched
- rows normalized
- top errors
- runtime
Deliverable D — README updated
- how to run extraction
- where raw data goes
- how normalization works
Common traps (don’t do this)
No. Raw is your safety net.
- Trap 1: “I’ll only keep clean data.”
No. Always add overlap + dedup.
- Trap 2: “Deltas are always correct.”
Logging later = debugging hell.
- Trap 3: “Logging later.”
Quick self-check (2 minutes)
Answer yes/no:
- Do I store raw artifacts for every run?
- Can I rerun and reproduce the same output?
- Do I have a delta strategy + overlap + dedup?
- Can I explain the last run (rows/time/errors) quickly?
- Can I recover from a failure without manual chaos?
If any “no” — fix it before moving on.
Next module preview (W9–W10)
Next: Data Quality Rules & Mapping.
We’ll turn your W4–W5 rules into something that runs automatically after extraction.