Phase 2 · W7–W8

W7–W8: SAP Extraction Patterns (OData / files / exports)

Build one extraction flow that is predictable, observable, and resumable instead of fragile.

Suggested time: 4–6 hours/week

Outcomes

One working extractor (pick one method: OData OR files).
A raw data storage pattern (so you never lose the original truth).
A normalization step (raw → clean tables).
A delta strategy (even if simple at first).
Logging and basic metrics (so you can debug fast).
A “failure plan” (retries, resume, partial loads).

Deliverables

A script or service that pulls data (OData OR file ingestion) and writes raw artifacts to disk with run_id.
A step that converts raw → clean output (JSONL or staging table).
A generated markdown file or console summary with rows fetched, rows normalized, top errors, and runtime.
README updated with extraction run steps, raw storage path, and normalization behavior.

Prerequisites

W6: Packaging, Testing Baseline, and “Definition of Done”

W7–W8: SAP Extraction Patterns (OData / files / exports)

What you’re doing

You’re learning how to pull SAP data without turning your life into a support nightmare.

Extraction is where most “data projects” die:

wrong deltas
timeouts
missing authorizations
broken formats
silent truncation
“it worked in QA”

So we build extraction like an engineer: predictable, observable, and resumable.

Time: 4–6 hours/week
Output: one extraction pipeline that can fetch data (OData or file export), store raw snapshots, and produce a clean normalized output

The promise (what you’ll have by the end)

By the end of W8 you will have:

One working extractor (pick one method: OData OR files)
A raw data storage pattern (so you never lose the original truth)
A normalization step (raw → clean tables)
A delta strategy (even if simple at first)
Logging and basic metrics (so you can debug fast)
A “failure plan” (retries, resume, partial loads)

Pick ONE extraction method (don’t be greedy)

Choose based on your reality:

Option A — OData

Best when:

you have stable services
you can filter by date/changed-on
you want structured JSON

Option B — Files / Exports

Best when:

the org exports CSV/IDoc dumps already
OData is slow or blocked
you can get scheduled drops (SFTP / Share / email attachments — yes, real life)

Pick ONE now. You can add the second later.

Your extraction “golden rules”

Always store raw data first.
Never trust deltas blindly.
Every run has a run_id and a timestamp.
Failures must be restartable.
Logs must tell you what happened.

Step-by-step checklist

1) Define the dataset

Write down:

object (BP / addresses / partner functions / etc.)
fields
expected volume
delta key (changed_on, or something else)
file format (if file-based)

Keep it on one page.

2) Build raw storage

Create a folder like:

data/raw/<dataset>/<date>/<run_id>.json (or .csv)

Raw means:

unmodified
exactly what you got from SAP/export

This is your insurance.

3) Build normalization (raw → clean)

Normalize into:

a clean JSONL file, or
Postgres staging table

Goal:

consistent columns
types fixed
missing values handled
duplicates flagged (not hidden)

4) Add delta strategy (simple but honest)

Start simple:

full load weekly + daily small incremental
incremental by changed_on with a safety overlap window (e.g., last 3 days)

Important:

overlap is normal
dedup is mandatory

5) Add observability

Minimum logs:

dataset name
run_id
start/end time
rows fetched
rows normalized
errors count

If you can’t answer “what happened in the last run?” you will suffer.

6) Add failure handling

Minimum:

retries for network
resume from last successful page/file
keep partial raw artifacts (don’t delete)

Deliverables (you must ship these)

Deliverable A — Extractor

A script or service that pulls data (OData OR file ingestion)
It writes raw artifacts to disk with run_id

Deliverable B — Normalizer

A step that converts raw → clean output (JSONL or staging table)

Deliverable C — Run report (tiny but real)

A generated markdown file or console summary:
rows fetched
rows normalized
top errors
runtime

Deliverable D — README updated

how to run extraction
where raw data goes
how normalization works

Common traps (don’t do this)

No. Raw is your safety net.

Trap 1: “I’ll only keep clean data.”

No. Always add overlap + dedup.

Trap 2: “Deltas are always correct.”

Logging later = debugging hell.

Trap 3: “Logging later.”

Quick self-check (2 minutes)

Answer yes/no:

Do I store raw artifacts for every run?
Can I rerun and reproduce the same output?
Do I have a delta strategy + overlap + dedup?
Can I explain the last run (rows/time/errors) quickly?
Can I recover from a failure without manual chaos?

If any “no” — fix it before moving on.

Next module preview (W9–W10)

Next: Data Quality Rules & Mapping.
We’ll turn your W4–W5 rules into something that runs automatically after extraction.

Next module: W9–W10 — W9–W10: Data Quality Rules & Mapping (DQ checks, business rules, error categories)