Phase 4 · W29–W30

W29–W30: Knowledge Base Design (sources, chunking, metadata)

Design a trustworthy, structured knowledge base with chunking and metadata that supports source-grounded answers.

Suggested time: 4–6 hours/week

Outcomes

  • A list of approved sources (what’s allowed in the knowledge base).
  • A document structure standard (so content is consistent).
  • A chunking strategy that doesn’t ruin meaning.
  • A metadata schema that makes retrieval smart.
  • An ingestion pipeline that produces “chunks” you can index later.

Deliverables

  • Source rules doc with allowed and forbidden source types.
  • Standard markdown template with at least 3 docs using it.
  • Documented chunking strategy plus metadata schema.
  • Generated chunks.jsonl where each record has text + metadata.

Prerequisites

  • W27–W28: Operational Metrics & Reporting

W29–W30: Knowledge Base Design (sources, chunking, metadata)

What you’re doing

You stop having “docs scattered everywhere”.
You build a system that can answer questions with sources.

This is where RAG becomes real:

  • not magic chat
  • but a knowledge system with governance

Time: 4–6 hours/week
Output: a curated source set + chunking strategy + metadata schema + an ingestion pipeline that turns docs into searchable chunks


The promise (what you’ll have by the end)

By the end of W30 you will have:

  • A list of approved sources (what’s allowed in the knowledge base)
  • A document structure standard (so content is consistent)
  • A chunking strategy that doesn’t ruin meaning
  • A metadata schema that makes retrieval smart
  • An ingestion pipeline that produces “chunks” you can index later

The rule: garbage in = garbage answers

If you feed messy docs, you get messy answers.
So we design the knowledge base like data engineering:

  • clear sources
  • clear structure
  • clean metadata
  • repeatable ingestion

Step 1: Define your sources (what goes in)

Pick 3–6 source types max:

  • runbooks (how-to)
  • RCA summaries (why it happened)
  • mapping rules (old→new)
  • interface specs (what data moves where)
  • “known issues” list
  • cutover checklists (optional)

Do not include:

  • private personal data
  • random chat logs
  • unverified notes

If you can’t trust it, don’t ingest it.


Step 2: Create a doc standard (simple template)

Every doc should have:

  • title
  • system/context (AS4/PS4/MDG/S4)
  • tags
  • owner
  • updated_at
  • “When to use” section
  • “Steps” section
  • “Common failures” section
  • references/links

Consistency beats beauty.


Step 3: Design chunking (don’t destroy context)

Chunking goal:

  • small enough to retrieve
  • big enough to make sense

Simple rules:

  • chunk by headings/sections
  • keep tables/code blocks intact
  • include doc title + section title in every chunk
  • avoid splitting in the middle of a procedure

Start with 300–800 tokens per chunk (roughly).
Adjust later.


Step 4: Metadata schema (this is the secret sauce)

Metadata makes retrieval not dumb.

Each chunk should store:

  • doc_id
  • doc_title
  • section_title
  • system (MDG/S4/etc.)
  • object (BP/material/etc. if relevant)
  • process (cutover/incident/interface/etc.)
  • tags
  • owner
  • updated_at
  • sensitivity (public/internal)

This lets you filter and rank like an engineer.


Step 5: Build ingestion pipeline (repeatable)

Pipeline steps:

  1. read docs from /knowledge/
  2. parse structure (headings)
  3. chunk content
  4. attach metadata
  5. output a JSONL file of chunks

Don’t jump to embeddings yet.
First produce clean chunks.


Deliverables (you must ship these)

Deliverable A — Source list + rules

  • a doc: what sources are allowed
  • what is forbidden

Deliverable B — Doc templates

  • a standard markdown template exists
  • at least 3 docs created using it

Deliverable C — Chunking + metadata schema

  • documented chunking strategy
  • metadata schema defined (JSON schema or TS type)

Deliverable D — Ingestion output

  • a generated chunks.jsonl file exists
  • each record has text + metadata

Common traps (don’t do this)

Dumping PDFs is how you get unusable retrieval. Structure first.

  • Trap 1: “Just dump PDFs.”

Without metadata, retrieval becomes random.

  • Trap 2: “No metadata.”

Start with high-trust sources only.

  • Trap 3: “Too many sources.”

Quick self-check (2 minutes)

Answer yes/no:

  • Do I have an approved sources list?
  • Are docs consistent with a template?
  • Does my chunking preserve procedure context?
  • Does every chunk have useful metadata?
  • Can I regenerate chunks.jsonl any time?

If any “no” — fix it before moving on.


Next module: W31–W32W31–W32: Retrieval Quality (search, ranking, relevance)