Phase 2 · W11–W12
W11–W12: Storage Layer (Postgres schema, migrations)
Build a real storage spine with Postgres schema + migrations so loads and queries are reliable.
Suggested time: 4–6 hours/week
Outcomes
- A Postgres database you can spin up locally.
- A schema that separates raw, staging, and curated data.
- Migrations (so schema changes are controlled, not chaos).
- A load process (normalized → tables) that is idempotent.
- Basic queries that answer real AMS questions.
- A tiny “data dictionary” so people understand fields.
Deliverables
- Local Postgres setup with documented connection config.
- Schema + migrations that can be applied from scratch.
- Loader that supports idempotent normalized → staging loads.
- At least 5 saved and documented useful queries.
Prerequisites
- W9–W10: Data Quality Rules & Mapping (DQ checks, business rules, error categories)
W11–W12: Storage Layer (Postgres schema, migrations)
What you’re doing
You stop treating your pipeline like “some scripts”.
You give it a real spine: a database with a schema you control.
Files are okay for demos.
But if you want “engineering”, you need:
- a schema
- versioned changes (migrations)
- repeatable queries
- history you can trust
Time: 4–6 hours/week
Output: a Postgres storage layer with a clear schema, migrations, and a reliable way to load and query data
The promise (what you’ll have by the end)
By the end of W12 you will have:
- A Postgres database you can spin up locally
- A schema that separates raw, staging, and curated data
- Migrations (so schema changes are controlled, not chaos)
- A load process (normalized → tables) that is idempotent
- Basic queries that answer real AMS questions
- A tiny “data dictionary” so people understand fields
The rule: design for change
Your schema will change. Always.
So you don’t “edit tables manually”.
You use migrations.
Migrations = adult supervision for your database.
Minimal schema (don’t overbuild)
Use a 3-layer model:
1) raw
- raw_payload (json/text)
- source
- extracted_at
- run_id
2) staging
- normalized columns (typed)
- load_at
- run_id
- record_hash (for dedup)
3) curated
- your “best version of truth”
- keys + important fields
- last_seen_at
- quality_status (optional)
This is enough for 80% of use cases.
Step-by-step checklist
1) Run Postgres locally
Use docker-compose if you already have it, or any simple local setup.
Goal: one command to start DB.
2) Define tables + keys
Pick your dataset (BP slice / addresses / partner functions).
Define:
- primary key strategy
- natural key vs surrogate key
- unique constraints
- indexes (only the obvious ones)
Do not try to model the whole SAP data model. You’re not SAP.
3) Add migrations
Use whatever you already use in repo (Alembic if Python is common).
If nothing exists, create a minimal migrations folder and keep SQL migration files.
No magic needed.
4) Build idempotent loading
If you rerun the same load:
- you don’t duplicate records
- you update correctly
- you can detect changes
Common patterns:
- upsert by key
- record_hash comparison
- soft history table (optional)
Start simple:
- upsert into staging
- rebuild curated from staging (for now)
5) Add 5 “AMS queries”
Write queries that answer real questions:
- how many records changed since last run?
- top DQ errors by rule_id (join from your DQ output)
- duplicates by key
- missing required fields count
- records with mapping applied (old→new) count
Put them in:
So they’re reusable.
- `queries/` folder
- or README section
6) Add a tiny data dictionary
One markdown file:
- table list
- columns explained (short)
- key fields highlighted
This makes your repo look professional instantly.
Deliverables (you must ship these)
Deliverable A — DB setup
- Local Postgres runs
- Connection config documented
Deliverable B — Schema + migrations
- Tables exist via migrations
- Schema versioning works
Deliverable C — Loader
- Normalized data loads into staging
- Idempotent behavior (rerun doesn’t duplicate)
Deliverable D — Useful queries
- At least 5 queries saved and documented
Common traps (don’t do this)
No. Minimal schema that supports your pipeline first.
- Trap 1: “I’ll design perfect schema like SAP.”
No. Raw/staging/curated separation saves your future self.
- Trap 2: “I’ll load everything into one giant table.”
Skipping migrations = guaranteed future pain.
- Trap 3: “I’ll skip migrations.”
Quick self-check (2 minutes)
Answer yes/no:
- Can I start Postgres in one command?
- Can I apply migrations from scratch reliably?
- Can I load the same dataset twice without duplicates?
- Can I query changes and DQ issues easily?
- Do I have a data dictionary that explains my schema?
If any “no” — fix it before moving on.
Next module preview (W13–W14)
Next: Automation & Scheduling.
Jobs, retries, idempotency, and how to run the pipeline like a real service.