Phase 2 · W9–W10
W9–W10: Data Quality Rules & Mapping (DQ checks, business rules, error categories)
Turn data quality from vague complaint into an automated gate with explicit rules, mapping, and measurable output.
Suggested time: 4–6 hours/week
Outcomes
- A DQ rules engine (even if simple) that produces structured errors.
- A mapping layer (old → new values) that is transparent and testable.
- An error taxonomy you can reuse in dashboards and ticket triage.
- A DQ report you can send to anyone without embarrassment.
- A clear separation between “data is wrong”, “mapping is missing”, “process/config is broken”, and “interface overwrote it”.
Deliverables
- DQ rules engine exists and running the pipeline produces structured errors.
- One mapping dataset exists and is applied in one place with at least one mapping test.
- Generated DQ report exists with counts and top issues.
- Quality gate policy is documented and pipeline can fail fast on critical errors.
Prerequisites
- W7–W8: SAP Extraction Patterns (OData / files / exports)
W9–W10: Data Quality Rules & Mapping (DQ checks, business rules, error categories)
What you’re doing
You’re turning “data quality” from a vague complaint into an automated gate.
Most SAP teams “do DQ” like this:
- someone spots bad data
- someone pings someone
- someone fixes it in GUI
- repeat forever
You’re building the opposite:
- rules that run every time
- errors that are structured
- mapping that is explicit
- output that is measurable
Time: 4–6 hours/week
Output: an automated DQ + mapping step that runs after extraction and produces a clean report (and optionally a “fix plan”)
The promise (what you’ll have by the end)
By the end of W10 you will have:
- A DQ rules engine (even if simple) that produces structured errors
- A mapping layer (old → new values) that is transparent and testable
- An error taxonomy you can reuse in dashboards and ticket triage
- A DQ report you can send to anyone without embarrassment
- A clear separation between:
- “data is wrong”
- “mapping is missing”
- “process/config is broken”
- “interface overwrote it”
The rule: no silent changes
If data changes:
- you log it
- you explain it
- you can reproduce it
No “it got overwritten somehow”.
Build your DQ pipeline (simple but real)
1) Inputs
Use the normalized output from W7–W8.
If you don’t have normalized output, stop and fix that first.
2) DQ checks (use your W4–W5 work)
Start with 10–20 rules that hit 80% of pain.
Examples:
- required field missing
- invalid format (VAT, postal code)
- invalid value (not in allowed set)
- cross-field inconsistency
- duplicate key
- invalid partner function set (SP/BP/PY/SH incomplete)
Each rule must output:
- rule_id
- severity (error/warn)
- category (from taxonomy)
- message
- fields
- object_key (BP number etc.)
- suggested action (optional)
3) Mapping layer (make it explicit)
Create one mapping file/table:
- old_value → new_value
- scope (sales org, company code, country, etc.)
- effective date (optional)
Then apply mapping in one place.
No “random mapping logic” scattered around code.
4) Report generation
Produce a report that includes:
- total records processed
- total errors/warnings
- top 10 rules by frequency
- top 10 fields with issues
- sample rows (anonymized)
- recommended next actions
If the report is ugly, no one will use it. Keep it readable.
5) Quality gate decision
Define a simple policy:
- If ERROR count > 0 for critical rules → fail the pipeline
- If only WARN → pipeline passes but report is generated
This is how you stop shipping bad data downstream.
Deliverables (you must ship these)
Deliverable A — DQ rules engine
- Rules exist (code or YAML)
- Running the pipeline produces structured errors
Deliverable B — Mapping layer
- One mapping dataset exists (csv/json)
- Mapping is applied in one place
- There is at least one mapping test
Deliverable C — DQ report
- Generated markdown/HTML summary exists
- It’s readable and includes counts + top issues
Deliverable D — Quality gate policy
- Documented pass/fail policy exists
- Pipeline can “fail fast” on critical errors
Common traps (don’t do this)
Later means never. Gate it now.
- Trap 1: “We’ll fix errors manually later.”
No. One mapping layer. One source of truth.
- Trap 2: “Mapping logic everywhere.”
Start with the rules that match real tickets.
- Trap 3: “Too many rules.”
Quick self-check (2 minutes)
Answer yes/no:
- Do my rules produce structured errors (not just print statements)?
- Is mapping explicit, testable, and centralized?
- Can I tell the difference between data errors vs mapping vs config vs overwrite?
- Does my report show counts and top issues?
- Do I have a pass/fail policy that prevents bad downstream updates?
If any “no” — fix it before moving on.
Next module preview (W11–W12)
Next: Storage Layer (Postgres).
We’ll stop living in files and build a proper schema + migrations so the pipeline becomes a real system.