Phase 2 · W9–W10

W9–W10: Data Quality Rules & Mapping (DQ checks, business rules, error categories)

Turn data quality from vague complaint into an automated gate with explicit rules, mapping, and measurable output.

Suggested time: 4–6 hours/week

Outcomes

  • A DQ rules engine (even if simple) that produces structured errors.
  • A mapping layer (old → new values) that is transparent and testable.
  • An error taxonomy you can reuse in dashboards and ticket triage.
  • A DQ report you can send to anyone without embarrassment.
  • A clear separation between “data is wrong”, “mapping is missing”, “process/config is broken”, and “interface overwrote it”.

Deliverables

  • DQ rules engine exists and running the pipeline produces structured errors.
  • One mapping dataset exists and is applied in one place with at least one mapping test.
  • Generated DQ report exists with counts and top issues.
  • Quality gate policy is documented and pipeline can fail fast on critical errors.

Prerequisites

  • W7–W8: SAP Extraction Patterns (OData / files / exports)

W9–W10: Data Quality Rules & Mapping (DQ checks, business rules, error categories)

What you’re doing

You’re turning “data quality” from a vague complaint into an automated gate.

Most SAP teams “do DQ” like this:

  • someone spots bad data
  • someone pings someone
  • someone fixes it in GUI
  • repeat forever

You’re building the opposite:

  • rules that run every time
  • errors that are structured
  • mapping that is explicit
  • output that is measurable

Time: 4–6 hours/week
Output: an automated DQ + mapping step that runs after extraction and produces a clean report (and optionally a “fix plan”)


The promise (what you’ll have by the end)

By the end of W10 you will have:

  • A DQ rules engine (even if simple) that produces structured errors
  • A mapping layer (old → new values) that is transparent and testable
  • An error taxonomy you can reuse in dashboards and ticket triage
  • A DQ report you can send to anyone without embarrassment
  • A clear separation between:
  • “data is wrong”
  • “mapping is missing”
  • “process/config is broken”
  • “interface overwrote it”

The rule: no silent changes

If data changes:

  • you log it
  • you explain it
  • you can reproduce it

No “it got overwritten somehow”.


Build your DQ pipeline (simple but real)

1) Inputs

Use the normalized output from W7–W8.
If you don’t have normalized output, stop and fix that first.

2) DQ checks (use your W4–W5 work)

Start with 10–20 rules that hit 80% of pain.

Examples:

  • required field missing
  • invalid format (VAT, postal code)
  • invalid value (not in allowed set)
  • cross-field inconsistency
  • duplicate key
  • invalid partner function set (SP/BP/PY/SH incomplete)

Each rule must output:

  • rule_id
  • severity (error/warn)
  • category (from taxonomy)
  • message
  • fields
  • object_key (BP number etc.)
  • suggested action (optional)

3) Mapping layer (make it explicit)

Create one mapping file/table:

  • old_value → new_value
  • scope (sales org, company code, country, etc.)
  • effective date (optional)

Then apply mapping in one place.
No “random mapping logic” scattered around code.

4) Report generation

Produce a report that includes:

  • total records processed
  • total errors/warnings
  • top 10 rules by frequency
  • top 10 fields with issues
  • sample rows (anonymized)
  • recommended next actions

If the report is ugly, no one will use it. Keep it readable.

5) Quality gate decision

Define a simple policy:

  • If ERROR count > 0 for critical rules → fail the pipeline
  • If only WARN → pipeline passes but report is generated

This is how you stop shipping bad data downstream.


Deliverables (you must ship these)

Deliverable A — DQ rules engine

  • Rules exist (code or YAML)
  • Running the pipeline produces structured errors

Deliverable B — Mapping layer

  • One mapping dataset exists (csv/json)
  • Mapping is applied in one place
  • There is at least one mapping test

Deliverable C — DQ report

  • Generated markdown/HTML summary exists
  • It’s readable and includes counts + top issues

Deliverable D — Quality gate policy

  • Documented pass/fail policy exists
  • Pipeline can “fail fast” on critical errors

Common traps (don’t do this)

Later means never. Gate it now.

  • Trap 1: “We’ll fix errors manually later.”

No. One mapping layer. One source of truth.

  • Trap 2: “Mapping logic everywhere.”

Start with the rules that match real tickets.

  • Trap 3: “Too many rules.”

Quick self-check (2 minutes)

Answer yes/no:

  • Do my rules produce structured errors (not just print statements)?
  • Is mapping explicit, testable, and centralized?
  • Can I tell the difference between data errors vs mapping vs config vs overwrite?
  • Does my report show counts and top issues?
  • Do I have a pass/fail policy that prevents bad downstream updates?

If any “no” — fix it before moving on.


Next module preview (W11–W12)

Next: Storage Layer (Postgres).
We’ll stop living in files and build a proper schema + migrations so the pipeline becomes a real system.

Next module: W11–W12W11–W12: Storage Layer (Postgres schema, migrations)