Phase 2 · W9–W10

W9–W10: Data Quality Rules & Mapping (DQ checks, business rules, error categories)

Turn data quality from vague complaint into an automated gate with explicit rules, mapping, and measurable output.

Suggested time: 4–6 hours/week

Outcomes

A DQ rules engine (even if simple) that produces structured errors.
A mapping layer (old → new values) that is transparent and testable.
An error taxonomy you can reuse in dashboards and ticket triage.
A DQ report you can send to anyone without embarrassment.
A clear separation between “data is wrong”, “mapping is missing”, “process/config is broken”, and “interface overwrote it”.

Deliverables

DQ rules engine exists and running the pipeline produces structured errors.
One mapping dataset exists and is applied in one place with at least one mapping test.
Generated DQ report exists with counts and top issues.
Quality gate policy is documented and pipeline can fail fast on critical errors.

Prerequisites

W7–W8: SAP Extraction Patterns (OData / files / exports)

W9–W10: Data Quality Rules & Mapping (DQ checks, business rules, error categories)

What you’re doing

You’re turning “data quality” from a vague complaint into an automated gate.

Most SAP teams “do DQ” like this:

someone spots bad data
someone pings someone
someone fixes it in GUI
repeat forever

You’re building the opposite:

rules that run every time
errors that are structured
mapping that is explicit
output that is measurable

Time: 4–6 hours/week
Output: an automated DQ + mapping step that runs after extraction and produces a clean report (and optionally a “fix plan”)

The promise (what you’ll have by the end)

By the end of W10 you will have:

A DQ rules engine (even if simple) that produces structured errors
A mapping layer (old → new values) that is transparent and testable
An error taxonomy you can reuse in dashboards and ticket triage
A DQ report you can send to anyone without embarrassment
A clear separation between:
“data is wrong”
“mapping is missing”
“process/config is broken”
“interface overwrote it”

The rule: no silent changes

If data changes:

you log it
you explain it
you can reproduce it

No “it got overwritten somehow”.

Build your DQ pipeline (simple but real)

1) Inputs

Use the normalized output from W7–W8.
If you don’t have normalized output, stop and fix that first.

2) DQ checks (use your W4–W5 work)

Start with 10–20 rules that hit 80% of pain.

Examples:

required field missing
invalid format (VAT, postal code)
invalid value (not in allowed set)
cross-field inconsistency
duplicate key
invalid partner function set (SP/BP/PY/SH incomplete)

Each rule must output:

rule_id
severity (error/warn)
category (from taxonomy)
message
fields
object_key (BP number etc.)
suggested action (optional)

3) Mapping layer (make it explicit)

Create one mapping file/table:

old_value → new_value
scope (sales org, company code, country, etc.)
effective date (optional)

Then apply mapping in one place.
No “random mapping logic” scattered around code.

4) Report generation

Produce a report that includes:

total records processed
total errors/warnings
top 10 rules by frequency
top 10 fields with issues
sample rows (anonymized)
recommended next actions

If the report is ugly, no one will use it. Keep it readable.

5) Quality gate decision

Define a simple policy:

If ERROR count > 0 for critical rules → fail the pipeline
If only WARN → pipeline passes but report is generated

This is how you stop shipping bad data downstream.

Deliverables (you must ship these)

Deliverable A — DQ rules engine

Rules exist (code or YAML)
Running the pipeline produces structured errors

Deliverable B — Mapping layer

One mapping dataset exists (csv/json)
Mapping is applied in one place
There is at least one mapping test

Deliverable C — DQ report

Generated markdown/HTML summary exists
It’s readable and includes counts + top issues

Deliverable D — Quality gate policy

Documented pass/fail policy exists
Pipeline can “fail fast” on critical errors

Common traps (don’t do this)

Later means never. Gate it now.

Trap 1: “We’ll fix errors manually later.”

No. One mapping layer. One source of truth.

Trap 2: “Mapping logic everywhere.”

Start with the rules that match real tickets.

Trap 3: “Too many rules.”

Quick self-check (2 minutes)

Answer yes/no:

Do my rules produce structured errors (not just print statements)?
Is mapping explicit, testable, and centralized?
Can I tell the difference between data errors vs mapping vs config vs overwrite?
Does my report show counts and top issues?
Do I have a pass/fail policy that prevents bad downstream updates?

If any “no” — fix it before moving on.

Next module preview (W11–W12)

Next: Storage Layer (Postgres).
We’ll stop living in files and build a proper schema + migrations so the pipeline becomes a real system.

Next module: W11–W12 — W11–W12: Storage Layer (Postgres schema, migrations)