Phase 2 · W7–W8

W7–W8: SAP Extraction Patterns (OData / files / exports)

Build one extraction flow that is predictable, observable, and resumable instead of fragile.

Suggested time: 4–6 hours/week

Outcomes

  • One working extractor (pick one method: OData OR files).
  • A raw data storage pattern (so you never lose the original truth).
  • A normalization step (raw → clean tables).
  • A delta strategy (even if simple at first).
  • Logging and basic metrics (so you can debug fast).
  • A “failure plan” (retries, resume, partial loads).

Deliverables

  • A script or service that pulls data (OData OR file ingestion) and writes raw artifacts to disk with run_id.
  • A step that converts raw → clean output (JSONL or staging table).
  • A generated markdown file or console summary with rows fetched, rows normalized, top errors, and runtime.
  • README updated with extraction run steps, raw storage path, and normalization behavior.

Prerequisites

  • W6: Packaging, Testing Baseline, and “Definition of Done”

W7–W8: SAP Extraction Patterns (OData / files / exports)

What you’re doing

You’re learning how to pull SAP data without turning your life into a support nightmare.

Extraction is where most “data projects” die:

  • wrong deltas
  • timeouts
  • missing authorizations
  • broken formats
  • silent truncation
  • “it worked in QA”

So we build extraction like an engineer: predictable, observable, and resumable.

Time: 4–6 hours/week
Output: one extraction pipeline that can fetch data (OData or file export), store raw snapshots, and produce a clean normalized output


The promise (what you’ll have by the end)

By the end of W8 you will have:

  • One working extractor (pick one method: OData OR files)
  • A raw data storage pattern (so you never lose the original truth)
  • A normalization step (raw → clean tables)
  • A delta strategy (even if simple at first)
  • Logging and basic metrics (so you can debug fast)
  • A “failure plan” (retries, resume, partial loads)

Pick ONE extraction method (don’t be greedy)

Choose based on your reality:

Option A — OData

Best when:

  • you have stable services
  • you can filter by date/changed-on
  • you want structured JSON

Option B — Files / Exports

Best when:

  • the org exports CSV/IDoc dumps already
  • OData is slow or blocked
  • you can get scheduled drops (SFTP / Share / email attachments — yes, real life)

Pick ONE now. You can add the second later.


Your extraction “golden rules”

  1. Always store raw data first.
  2. Never trust deltas blindly.
  3. Every run has a run_id and a timestamp.
  4. Failures must be restartable.
  5. Logs must tell you what happened.

Step-by-step checklist

1) Define the dataset

Write down:

  • object (BP / addresses / partner functions / etc.)
  • fields
  • expected volume
  • delta key (changed_on, or something else)
  • file format (if file-based)

Keep it on one page.

2) Build raw storage

Create a folder like:

  • data/raw/<dataset>/<date>/<run_id>.json (or .csv)

Raw means:

  • unmodified
  • exactly what you got from SAP/export

This is your insurance.

3) Build normalization (raw → clean)

Normalize into:

  • a clean JSONL file, or
  • Postgres staging table

Goal:

  • consistent columns
  • types fixed
  • missing values handled
  • duplicates flagged (not hidden)

4) Add delta strategy (simple but honest)

Start simple:

or

  • full load weekly + daily small incremental
  • incremental by changed_on with a safety overlap window (e.g., last 3 days)

Important:

  • overlap is normal
  • dedup is mandatory

5) Add observability

Minimum logs:

  • dataset name
  • run_id
  • start/end time
  • rows fetched
  • rows normalized
  • errors count

If you can’t answer “what happened in the last run?” you will suffer.

6) Add failure handling

Minimum:

  • retries for network
  • resume from last successful page/file
  • keep partial raw artifacts (don’t delete)

Deliverables (you must ship these)

Deliverable A — Extractor

  • A script or service that pulls data (OData OR file ingestion)
  • It writes raw artifacts to disk with run_id

Deliverable B — Normalizer

  • A step that converts raw → clean output (JSONL or staging table)

Deliverable C — Run report (tiny but real)

  • A generated markdown file or console summary:
  • rows fetched
  • rows normalized
  • top errors
  • runtime

Deliverable D — README updated

  • how to run extraction
  • where raw data goes
  • how normalization works

Common traps (don’t do this)

No. Raw is your safety net.

  • Trap 1: “I’ll only keep clean data.”

No. Always add overlap + dedup.

  • Trap 2: “Deltas are always correct.”

Logging later = debugging hell.

  • Trap 3: “Logging later.”

Quick self-check (2 minutes)

Answer yes/no:

  • Do I store raw artifacts for every run?
  • Can I rerun and reproduce the same output?
  • Do I have a delta strategy + overlap + dedup?
  • Can I explain the last run (rows/time/errors) quickly?
  • Can I recover from a failure without manual chaos?

If any “no” — fix it before moving on.


Next module preview (W9–W10)

Next: Data Quality Rules & Mapping.
We’ll turn your W4–W5 rules into something that runs automatically after extraction.

Next module: W9–W10W9–W10: Data Quality Rules & Mapping (DQ checks, business rules, error categories)