Phase 2 · W11–W12

W11–W12: Storage Layer (Postgres schema, migrations)

Build a real storage spine with Postgres schema + migrations so loads and queries are reliable.

Suggested time: 4–6 hours/week

Outcomes

A Postgres database you can spin up locally.
A schema that separates raw, staging, and curated data.
Migrations (so schema changes are controlled, not chaos).
A load process (normalized → tables) that is idempotent.
Basic queries that answer real AMS questions.
A tiny “data dictionary” so people understand fields.

Deliverables

Local Postgres setup with documented connection config.
Schema + migrations that can be applied from scratch.
Loader that supports idempotent normalized → staging loads.
At least 5 saved and documented useful queries.

Prerequisites

W9–W10: Data Quality Rules & Mapping (DQ checks, business rules, error categories)

W11–W12: Storage Layer (Postgres schema, migrations)

What you’re doing

You stop treating your pipeline like “some scripts”.
You give it a real spine: a database with a schema you control.

Files are okay for demos.
But if you want “engineering”, you need:

a schema
versioned changes (migrations)
repeatable queries
history you can trust

Time: 4–6 hours/week
Output: a Postgres storage layer with a clear schema, migrations, and a reliable way to load and query data

The promise (what you’ll have by the end)

By the end of W12 you will have:

A Postgres database you can spin up locally
A schema that separates raw, staging, and curated data
Migrations (so schema changes are controlled, not chaos)
A load process (normalized → tables) that is idempotent
Basic queries that answer real AMS questions
A tiny “data dictionary” so people understand fields

The rule: design for change

Your schema will change. Always.
So you don’t “edit tables manually”.
You use migrations.

Migrations = adult supervision for your database.

Minimal schema (don’t overbuild)

Use a 3-layer model:

1) raw

raw_payload (json/text)
source
extracted_at
run_id

2) staging

normalized columns (typed)
load_at
run_id
record_hash (for dedup)

3) curated

your “best version of truth”
keys + important fields
last_seen_at
quality_status (optional)

This is enough for 80% of use cases.

Step-by-step checklist

1) Run Postgres locally

Use docker-compose if you already have it, or any simple local setup.
Goal: one command to start DB.

2) Define tables + keys

Pick your dataset (BP slice / addresses / partner functions).
Define:

primary key strategy
natural key vs surrogate key
unique constraints
indexes (only the obvious ones)

Do not try to model the whole SAP data model. You’re not SAP.

3) Add migrations

Use whatever you already use in repo (Alembic if Python is common).
If nothing exists, create a minimal migrations folder and keep SQL migration files.
No magic needed.

4) Build idempotent loading

If you rerun the same load:

you don’t duplicate records
you update correctly
you can detect changes

Common patterns:

upsert by key
record_hash comparison
soft history table (optional)

Start simple:

upsert into staging
rebuild curated from staging (for now)

5) Add 5 “AMS queries”

Write queries that answer real questions:

how many records changed since last run?
top DQ errors by rule_id (join from your DQ output)
duplicates by key
missing required fields count
records with mapping applied (old→new) count

Put them in:

So they’re reusable.

`queries/` folder
or README section

6) Add a tiny data dictionary

One markdown file:

table list
columns explained (short)
key fields highlighted

This makes your repo look professional instantly.

Deliverables (you must ship these)

Deliverable A — DB setup

Local Postgres runs
Connection config documented

Deliverable B — Schema + migrations

Tables exist via migrations
Schema versioning works

Deliverable C — Loader

Normalized data loads into staging
Idempotent behavior (rerun doesn’t duplicate)

Deliverable D — Useful queries

At least 5 queries saved and documented

Common traps (don’t do this)

No. Minimal schema that supports your pipeline first.

Trap 1: “I’ll design perfect schema like SAP.”

No. Raw/staging/curated separation saves your future self.

Trap 2: “I’ll load everything into one giant table.”

Skipping migrations = guaranteed future pain.

Trap 3: “I’ll skip migrations.”

Quick self-check (2 minutes)

Answer yes/no:

Can I start Postgres in one command?
Can I apply migrations from scratch reliably?
Can I load the same dataset twice without duplicates?
Can I query changes and DQ issues easily?
Do I have a data dictionary that explains my schema?

If any “no” — fix it before moving on.

Next module preview (W13–W14)

Next: Automation & Scheduling.
Jobs, retries, idempotency, and how to run the pipeline like a real service.

Next module: W13–W14 — W13–W14: Automation & Scheduling (jobs, retries, idempotency)