Phase 4 · W29–W30
W29–W30: Knowledge Base Design (sources, chunking, metadata)
Design a trustworthy, structured knowledge base with chunking and metadata that supports source-grounded answers.
Suggested time: 4–6 hours/week
Outcomes
- A list of approved sources (what’s allowed in the knowledge base).
- A document structure standard (so content is consistent).
- A chunking strategy that doesn’t ruin meaning.
- A metadata schema that makes retrieval smart.
- An ingestion pipeline that produces “chunks” you can index later.
Deliverables
- Source rules doc with allowed and forbidden source types.
- Standard markdown template with at least 3 docs using it.
- Documented chunking strategy plus metadata schema.
- Generated chunks.jsonl where each record has text + metadata.
Prerequisites
- W27–W28: Operational Metrics & Reporting
W29–W30: Knowledge Base Design (sources, chunking, metadata)
What you’re doing
You stop having “docs scattered everywhere”.
You build a system that can answer questions with sources.
This is where RAG becomes real:
- not magic chat
- but a knowledge system with governance
Time: 4–6 hours/week
Output: a curated source set + chunking strategy + metadata schema + an ingestion pipeline that turns docs into searchable chunks
The promise (what you’ll have by the end)
By the end of W30 you will have:
- A list of approved sources (what’s allowed in the knowledge base)
- A document structure standard (so content is consistent)
- A chunking strategy that doesn’t ruin meaning
- A metadata schema that makes retrieval smart
- An ingestion pipeline that produces “chunks” you can index later
The rule: garbage in = garbage answers
If you feed messy docs, you get messy answers.
So we design the knowledge base like data engineering:
- clear sources
- clear structure
- clean metadata
- repeatable ingestion
Step 1: Define your sources (what goes in)
Pick 3–6 source types max:
- runbooks (how-to)
- RCA summaries (why it happened)
- mapping rules (old→new)
- interface specs (what data moves where)
- “known issues” list
- cutover checklists (optional)
Do not include:
- private personal data
- random chat logs
- unverified notes
If you can’t trust it, don’t ingest it.
Step 2: Create a doc standard (simple template)
Every doc should have:
- title
- system/context (AS4/PS4/MDG/S4)
- tags
- owner
- updated_at
- “When to use” section
- “Steps” section
- “Common failures” section
- references/links
Consistency beats beauty.
Step 3: Design chunking (don’t destroy context)
Chunking goal:
- small enough to retrieve
- big enough to make sense
Simple rules:
- chunk by headings/sections
- keep tables/code blocks intact
- include doc title + section title in every chunk
- avoid splitting in the middle of a procedure
Start with 300–800 tokens per chunk (roughly).
Adjust later.
Step 4: Metadata schema (this is the secret sauce)
Metadata makes retrieval not dumb.
Each chunk should store:
- doc_id
- doc_title
- section_title
- system (MDG/S4/etc.)
- object (BP/material/etc. if relevant)
- process (cutover/incident/interface/etc.)
- tags
- owner
- updated_at
- sensitivity (public/internal)
This lets you filter and rank like an engineer.
Step 5: Build ingestion pipeline (repeatable)
Pipeline steps:
- read docs from /knowledge/
- parse structure (headings)
- chunk content
- attach metadata
- output a JSONL file of chunks
Don’t jump to embeddings yet.
First produce clean chunks.
Deliverables (you must ship these)
Deliverable A — Source list + rules
- a doc: what sources are allowed
- what is forbidden
Deliverable B — Doc templates
- a standard markdown template exists
- at least 3 docs created using it
Deliverable C — Chunking + metadata schema
- documented chunking strategy
- metadata schema defined (JSON schema or TS type)
Deliverable D — Ingestion output
- a generated chunks.jsonl file exists
- each record has text + metadata
Common traps (don’t do this)
Dumping PDFs is how you get unusable retrieval. Structure first.
- Trap 1: “Just dump PDFs.”
Without metadata, retrieval becomes random.
- Trap 2: “No metadata.”
Start with high-trust sources only.
- Trap 3: “Too many sources.”
Quick self-check (2 minutes)
Answer yes/no:
- Do I have an approved sources list?
- Are docs consistent with a template?
- Does my chunking preserve procedure context?
- Does every chunk have useful metadata?
- Can I regenerate chunks.jsonl any time?
If any “no” — fix it before moving on.