Phase 4 · W29–W30

W29–W30: Knowledge Base Design (sources, chunking, metadata)

Design a trustworthy, structured knowledge base with chunking and metadata that supports source-grounded answers.

Suggested time: 4–6 hours/week

Outcomes

A list of approved sources (what’s allowed in the knowledge base).
A document structure standard (so content is consistent).
A chunking strategy that doesn’t ruin meaning.
A metadata schema that makes retrieval smart.
An ingestion pipeline that produces “chunks” you can index later.

Deliverables

Source rules doc with allowed and forbidden source types.
Standard markdown template with at least 3 docs using it.
Documented chunking strategy plus metadata schema.
Generated chunks.jsonl where each record has text + metadata.

Prerequisites

W27–W28: Operational Metrics & Reporting

W29–W30: Knowledge Base Design (sources, chunking, metadata)

What you’re doing

You stop having “docs scattered everywhere”.
You build a system that can answer questions with sources.

This is where RAG becomes real:

not magic chat
but a knowledge system with governance

Time: 4–6 hours/week
Output: a curated source set + chunking strategy + metadata schema + an ingestion pipeline that turns docs into searchable chunks

The promise (what you’ll have by the end)

By the end of W30 you will have:

A list of approved sources (what’s allowed in the knowledge base)
A document structure standard (so content is consistent)
A chunking strategy that doesn’t ruin meaning
A metadata schema that makes retrieval smart
An ingestion pipeline that produces “chunks” you can index later

The rule: garbage in = garbage answers

If you feed messy docs, you get messy answers.
So we design the knowledge base like data engineering:

clear sources
clear structure
clean metadata
repeatable ingestion

Step 1: Define your sources (what goes in)

Pick 3–6 source types max:

runbooks (how-to)
RCA summaries (why it happened)
mapping rules (old→new)
interface specs (what data moves where)
“known issues” list
cutover checklists (optional)

Do not include:

private personal data
random chat logs
unverified notes

If you can’t trust it, don’t ingest it.

Step 2: Create a doc standard (simple template)

Every doc should have:

title
system/context (AS4/PS4/MDG/S4)
tags
owner
updated_at
“When to use” section
“Steps” section
“Common failures” section
references/links

Consistency beats beauty.

Step 3: Design chunking (don’t destroy context)

Chunking goal:

small enough to retrieve
big enough to make sense

Simple rules:

chunk by headings/sections
keep tables/code blocks intact
include doc title + section title in every chunk
avoid splitting in the middle of a procedure

Start with 300–800 tokens per chunk (roughly).
Adjust later.

Step 4: Metadata schema (this is the secret sauce)

Metadata makes retrieval not dumb.

Each chunk should store:

doc_id
doc_title
section_title
system (MDG/S4/etc.)
object (BP/material/etc. if relevant)
process (cutover/incident/interface/etc.)
tags
owner
updated_at
sensitivity (public/internal)

This lets you filter and rank like an engineer.

Step 5: Build ingestion pipeline (repeatable)

Pipeline steps:

read docs from /knowledge/
parse structure (headings)
chunk content
attach metadata
output a JSONL file of chunks

Don’t jump to embeddings yet.
First produce clean chunks.

Deliverables (you must ship these)

Deliverable A — Source list + rules

a doc: what sources are allowed
what is forbidden

Deliverable B — Doc templates

a standard markdown template exists
at least 3 docs created using it

Deliverable C — Chunking + metadata schema

documented chunking strategy
metadata schema defined (JSON schema or TS type)

Deliverable D — Ingestion output

a generated chunks.jsonl file exists
each record has text + metadata

Common traps (don’t do this)

Dumping PDFs is how you get unusable retrieval. Structure first.

Trap 1: “Just dump PDFs.”

Without metadata, retrieval becomes random.

Trap 2: “No metadata.”

Start with high-trust sources only.

Trap 3: “Too many sources.”

Quick self-check (2 minutes)

Answer yes/no:

Do I have an approved sources list?
Are docs consistent with a template?
Does my chunking preserve procedure context?
Does every chunk have useful metadata?
Can I regenerate chunks.jsonl any time?

If any “no” — fix it before moving on.

Next module: W31–W32 — W31–W32: Retrieval Quality (search, ranking, relevance)