Modern SAP AMS: outcomes first, and careful use of agentic ways of working

The ticket says “Interface backlog blocking billing.” L2 is chasing logs, L3 is checking mappings, and L4 is already drafting a “quick fix” change request because month-end is close. Everyone is busy, SLAs will probably be green, and yet the same pattern will return after the next release freeze—because the real cause is hidden in a mix of undocumented rules, fragile batch chains, and a handover note that never made it into a runbook.

That is the daily reality across L2–L4: complex incidents, change requests, problem management, process improvements, and small-to-medium developments. If AMS only optimizes for ticket closure, it will look healthy while cost and risk drift quietly.

Why this matters now

Green SLAs can hide three expensive problems:

Repeat incidents: the same IDoc failures, batch delays, or authorization defects come back because fixes are local, not systemic.
Manual work that grows: triage, evidence gathering, and “who knows this?” escalations consume senior time. MTTR may improve short-term, but run cost rises.
Knowledge loss: the actual resolution steps sit in people’s heads or in ticket comments that are not searchable, not versioned, and not reviewed.

Modern SAP AMS is not a different contract. It is different day-to-day behavior: fewer repeats, safer changes, and a learning loop that turns every major issue into better monitoring, better runbooks, and clearer ownership.

Agentic / AI-assisted work can help here—but only if it is treated like engineering with tests and guardrails, not like a chat window that “sounds right.”

The mental model

Classic AMS optimizes for throughput: close tickets, meet response times, keep queues moving.

Modern AMS optimizes for outcomes:

reduce repeat rate (problems removed, not just patched)
reduce change failure rate (safer transports/imports, clearer rollback)
reduce manual touch time (less copy/paste triage)
keep run cost predictable (fewer surprises, fewer escalations)

Two rules of thumb:

If an incident repeats, it is a problem until proven otherwise. Treat “same symptom” as a signal for root-cause work, even if the workaround is known.
If you cannot measure agent quality, you cannot improve it. This comes straight from the source: agents regress silently after prompt or knowledge changes, and human impressions are inconsistent.

What changes in practice

From incident closure → root-cause removal
L2 closes the ticket, but L3/L4 owns the “why.” A recurring interface failure should end with updated mapping checks, monitoring thresholds, and a problem record with a clear “done” definition.
From tribal knowledge → searchable, versioned knowledge
Runbooks and known errors must be treated like code: versioned, reviewed, and linked to incidents/changes. Ticket comments are not a knowledge base.
From manual triage → assisted triage with evidence
Use AI to draft hypotheses and ask for missing data, but require it to attach evidence (logs, monitoring signals, runbook references). “Sounds plausible” is not acceptable for production decisions.
From reactive firefighting → risk-based prevention
High-risk areas (batch chains, master data replication, authorizations, critical interfaces) get prevention ownership: health checks, alert tuning, and regression tests after changes.
From “one vendor” thinking → explicit decision rights
Define who can approve what: L2 can execute runbook steps, L3 proposes code/config changes, L4 signs off technical design, business owns process acceptance. Separation of duties is not bureaucracy; it is control.
From “done = transported” → “done = observable”
A change is done when monitoring, rollback steps, and post-change verification are documented and executed. Otherwise you are just moving risk forward.
From ad-hoc improvements → learning loops
Every major incident produces at least one prevention artifact: a runbook update, a monitoring rule, or a regression case.

Agentic / AI pattern (without magic)

“Agentic” here means: a workflow where a system can plan steps, retrieve context, draft actions, and execute only pre-approved safe tasks under human control.

A realistic end-to-end workflow for L2–L4 incident + change handling:

Inputs

Incident ticket text, priority, affected process
Logs and monitoring summaries (generalization: whatever your landscape provides)
Related changes/transports history (metadata, not code by default)
Runbooks, known errors, interface specs, authorization concepts

Steps

Classify: incident vs request vs problem candidate; identify affected component (interface, batch, master data, auth).
Retrieve context: pull the relevant runbook sections and the last similar incidents (grounding).
Propose action: draft a triage plan: what to check, what evidence to collect, what is safe to try.
Request approval: if the plan includes risky actions (data correction, production config, transport/import), the agent must stop and request the right approval.
Execute safe tasks: only pre-approved actions like creating a draft incident update, generating a checklist, or preparing a change description. Execution in production should be tightly scoped.
Document: write back what was done, what evidence supports the conclusion, and what prevention item is created.

Guardrails

Least privilege: the agent should not have broad production access. Start read-only where possible.
Approvals and separation of duties: humans approve production changes, data corrections, and security decisions.
Audit trail: store prompts, retrieved sources, outputs, and approvals. The source stresses eval results must be stored and compared over time.
Rollback discipline: every proposed change includes rollback steps and verification checks.
Privacy: avoid sending sensitive business data into prompts; redact where needed (generalization, because the source does not specify your regulatory scope).

What stays human-owned: production change approval, business sign-off, authorization design decisions, and any action that can materially affect financial postings or master data integrity.

Honestly, this will slow you down at first because you are adding checks you used to skip under pressure.

Implementation steps (first 30 days)

Define outcomes for AMS (not just SLAs)
How: agree on 3–5 metrics: repeat rate, reopen rate, backlog aging, MTTR trend, change failure rate.
Success signal: weekly review uses these metrics, not only closure counts.
Map L2–L4 decision rights
How: write a one-page “who approves what” for incidents, changes, data fixes, and authorizations.
Success: fewer “who can sign this?” delays during critical incidents.
Create a small “golden set” for your agent use cases (from the source)
How: curate representative tasks: easy/medium/hard, edge cases, ambiguous inputs, previously broken cases (regressions).
Success: the set is used repeatedly, not once.
Define eval dimensions and acceptance gates (from the source)
How: score agent outputs on correctness, grounding, safety, usefulness, cost & latency.
Success: failing evals block release of agent changes (explicit guard in the source).
Start with shadow mode (from the source’s “shadow eval”)
How: run the agent in parallel on real tickets, but do not let it change outcomes. Compare recommendations with human actions.
Success: measurable reduction in manual triage time without increased errors.
Build a knowledge lifecycle
How: pick one area (e.g., interface failures) and enforce: every resolved major incident updates a runbook section. Version it.
Success: fewer escalations that start with “do we have a doc?”
Add a prevention deliverable to problem management
How: every problem record must produce a monitoring or regression artifact.
Success: repeat rate trends down for the targeted category.
Set privacy and logging rules for prompts
How: redact sensitive fields, define retention, store audit logs.
Success: security review has clear evidence of controls.

A limitation: if your logs and runbooks are incomplete or inconsistent, the agent will confidently produce weak answers unless grounding is enforced.

Pitfalls and anti-patterns

Automating a broken triage process (you just get faster confusion).
Trusting summaries without evidence links (no grounding).
Giving the agent broad production access “temporarily.”
No owner for knowledge updates; runbooks rot.
Measuring only ticket volume; prevention work looks “unproductive.”
Changing agent prompts/knowledge without regression evals (silent regressions are called out in the source).
Ignoring eval failures under time pressure (also a source failure mode).
Over-focusing on happy paths; real AMS is edge cases.
Blurring responsibilities: the agent “decides” instead of proposing.
Over-customizing workflows until nobody can maintain them.

Checklist

Do we track repeat incidents and reopen rate by category (interfaces, batch, master data, auth)?
Are decision rights written down for prod changes, data fixes, and security?
Do runbooks have versions and owners?
Is there a golden set of real AMS cases for agent evaluation?
Do evals cover correctness, grounding, safety, usefulness, cost & latency?
Do failing evals block changes to the agent?
Is the agent read-only by default, with explicit approval gates?
Do we store an audit trail of agent inputs, retrieved sources, and outputs?
Does every major incident create one prevention artifact?

FAQ

Is this safe in regulated environments?
It can be, if you treat the agent like any other tool: least privilege, separation of duties, audit trails, and strict data handling. The risky part is not the model; it is uncontrolled access and undocumented actions.

How do we measure value beyond ticket counts?
Use outcome metrics: repeat rate, reopen rate, MTTR trend, backlog aging, change failure rate, and manual touch time for triage. Ticket closure alone rewards short-term fixes.

What data do we need for RAG / knowledge retrieval?
Start with what you already have: runbooks, known errors, interface specs, and sanitized ticket histories. The key is quality and versioning, not volume. (Generalization: the source focuses on evals, not data architecture.)

How to start if the landscape is messy?
Pick one painful slice (e.g., MDG replication troubleshooting is used as a micro example in the source) and build a golden set around it. Shadow-evaluate first, then expand.

How do we stop the agent from making things up?
Require grounding: every recommendation must cite retrieved knowledge or tool outputs. Evaluate grounding explicitly, as the source suggests.

Who owns the agent in AMS?
Treat it like a shared operational asset: AMS lead owns process fit and metrics; solution architects own technical guardrails; security owns access and audit requirements.

Next action

Next week, take the last 20 high-effort L2–L4 tickets (interfaces, batch, master data, authorizations), label which ones repeated, and turn 5 of them into a first golden set with expected outcomes and failure signals—then run a shadow eval before you allow any agent output to influence production decisions.

#EVALUATION#GOLDEN-SET#REGRESSION#AGENT-QUALITY

Agentic Design Blueprint — 2/19/2026

Golden Set & Evals: How to Know Your Agent Works