Modern SAP AMS: outcome-driven operations with responsible agentic support
The interface backlog is blocking billing again. Someone proposes a “quick” data correction in production to unblock IDocs, while a change request for a small enhancement is waiting for a transport window. The incident gets closed on time, but the same pattern returns after the next release freeze. People remember fragments of the fix. The runbook is outdated. The real rule sits in a chat thread.
That’s normal L2–L4 AMS reality: complex incidents, change requests, problem management, process improvements, and small-to-medium developments competing for the same attention.
Why this matters now
Many SAP landscapes show “green” SLAs while the business still feels pain. Classic metrics (ticket closure time, SLA compliance) can hide:
- Repeat incidents: the same batch chain fails every month-end, reopened tickets look like “new work”.
- Manual touch work: triage, log collection, stakeholder updates, transport coordination.
- Knowledge loss: fixes live in heads; handovers reset maturity.
- Cost drift: more tickets and more escalations, even if each is closed “on time”.
Modern AMS is not about closing more tickets faster. It’s about reducing repeat demand, making change safer, and building learning loops so operations gets easier over time.
Agentic / AI-assisted ways of working can help with the manual parts (triage, evidence collection, drafting, documentation). But they also introduce a new risk: silent failure—confident answers without evidence. The source record behind this article is blunt: agents will fail; the question is whether they fail safely (Dzmitryi Kharlanau, “Failure Modes & Fallbacks”).
The mental model
Classic AMS optimizes for throughput:
- Intake → assign → fix → close
- Success = SLA met, backlog stable
Modern AMS optimizes for outcomes:
- Intake → restore service → remove root cause → prevent recurrence → codify knowledge
- Success = repeat rate down, change failure rate down, manual touch time down, run cost predictable
Two rules of thumb I use:
- If an incident repeats, it’s a problem until proven otherwise. Treat “reopen” and “same symptom” as signals, not noise.
- Never accept an automated answer for a critical decision without evidence. If logs/metrics are missing, the system must say so.
What changes in practice
-
From incident closure → to root-cause removal
Every high-impact incident ends with a short problem statement, likely causes, and a next preventive action (monitoring, code fix, master data rule, authorization adjustment). Closure includes “what stops this from coming back?” -
From tribal knowledge → to searchable, versioned knowledge
Runbooks are living documents: “symptom → checks → safe actions → escalation.” Version them. Link them to incidents/problems. Retire outdated steps explicitly. -
From manual triage → to assisted triage with evidence gates
Use automation to collect logs, interface statuses, batch outcomes, recent transports/imports, and known errors. But enforce: no diagnosis without retrieved evidence (a direct guard from the source). -
From reactive firefighting → to risk-based prevention
Track top recurring failure patterns: interfaces, batch chains, master data defects, authorizations. Put owners on the top 3 and measure whether repeats drop. -
From “one vendor” thinking → to clear decision rights
Define who can decide: functional vs technical, security vs operations, business sign-off vs IT sign-off. Especially for production data corrections and emergency changes. -
From “do the fix” → to rollback discipline
Every change (even small) has a rollback plan: what to revert, how to validate, and who approves. This slows you down at first, but it prevents long outages later. -
From activity metrics → to learning metrics
Add operational signals: reopen rate, repeat rate, MTTR trend, change failure rate, backlog aging, and manual touch time (even if estimated).
Agentic / AI pattern (without magic)
“Agentic” here means: a workflow where a system can plan steps, retrieve context, draft actions, and execute only pre-approved safe tasks under human control.
A realistic end-to-end workflow for L2–L4 (complex incident + follow-up change):
Inputs
- Incident description, priority, impacted process
- Monitoring signals (if available), logs, interface/IDoc status, batch chain outcomes
- Recent transports/imports and change calendar
- Runbooks and known-error knowledge base
Steps
- Classify and route: propose component (interface, batch, authorization, master data, custom code) and suggest L2/L3 owner.
- Retrieve context: pull relevant runbook sections and past similar incidents. If data is missing, it must say: verification not possible (matches the “missing data” failure mode).
- Propose actions: provide options with risk notes (e.g., reprocess interface vs restart job vs open problem record).
- Request approvals: for anything that changes production behavior (job restarts beyond runbook, config changes, data corrections, emergency transports).
- Execute safe tasks (only if pre-approved): collect evidence, draft stakeholder update, open problem ticket, prepare a change request template, generate a test checklist, or run read-only checks.
- Document: write the incident timeline, evidence used, decision points, and what was done. Store links to logs and approvals.
Guardrails (where the source record is very relevant)
- Missing data → explicit stop: no guessing. Ask for inputs or offer a generic checklist.
- Tool failure (timeouts/partial results) → degraded mode: retry with backoff, then switch to read-only or escalate.
- Low confidence (conflicting sources) → human-in-the-loop: present options and uncertainties.
- Guardrail violation → refuse: if the request exceeds authority (e.g., “apply this fix in prod now”), the system must refuse and explain allowed alternatives.
- Budget exhaustion (latency/cost limits) → partial result: return what is known and what is not.
Also add operational controls:
- Least privilege: read-only by default; no broad authorizations.
- Separation of duties: the person approving a production change is not the same “actor” executing it.
- Audit trail: log what data was accessed, what was proposed, what was approved, and what was executed.
- Rollback: required for changes; tested fallback paths (another guard from the source).
- Privacy: avoid copying sensitive business data into prompts; redact where possible.
What stays human-owned: production change approval, data corrections with audit impact, security decisions, and business sign-off on process changes. In practice, the agent can prepare, but not decide.
Implementation steps (first 30 days)
-
Define outcomes and baseline
Purpose: stop optimizing only for closure.
How: capture current reopen rate, repeat patterns, MTTR trend, change failure rate, backlog aging.
Success: baseline agreed and visible. -
Tighten intake quality for L2–L4
Purpose: reduce back-and-forth.
How: minimum fields for incidents/changes (impact, steps to reproduce, evidence links, business deadline).
Success: fewer clarification loops; faster assignment. -
Create “safe task” catalog
Purpose: decide what an agent may do.
How: list read-only checks, evidence collection, drafting updates, creating problem records. Explicitly exclude prod changes.
Success: approved list exists; exceptions require approval. -
Write fallback rules (explicit)
Purpose: prevent silent failure.
How: implement the five failure modes from the source (missing data, tool failure, low confidence, guardrail violation, budget exhaustion) with clear user messages.
Success: every fallback is visible and logged. -
Start with one workflow: triage + documentation
Purpose: get value without risk.
How: use assisted retrieval from runbooks/known errors; draft incident summary and next actions.
Success: manual touch time down; better evidence trails. -
Establish approval gates and audit logging
Purpose: safe operations.
How: define who approves what (emergency change, data correction, config). Log approvals and execution steps.
Success: no “mystery fixes”; audits are easier. -
Introduce a weekly problem review
Purpose: turn repeats into prevention.
How: top recurring incidents, assign owners, track preventive actions.
Success: repeat rate starts trending down (even slowly). -
Measure fallback frequency
Purpose: detect agent reliability issues.
How: count fallback types; alert if fallback rate spikes (from the source’s observability section).
Success: you can see when the system is “blind” (missing data) or “broken” (tool failure).
Pitfalls and anti-patterns
- Automating a broken intake process and expecting better outcomes.
- Trusting summaries that don’t cite evidence (classic “missing data” failure).
- Giving broad access “to make it work”, then discovering audit gaps.
- No clear owner for problem management; everything stays as incidents.
- Treating low confidence as “try harder” instead of escalating with options.
- Over-customizing runbooks until nobody maintains them.
- Measuring only ticket counts and celebrating while repeat demand grows.
- Allowing production actions without a rollback plan.
- Ignoring change management: people bypass the process under pressure.
- Not testing fallback paths; failures become accidental and confusing.
Checklist
- Do we track repeat incidents and reopen rate, not just SLA closure?
- Are runbooks searchable, versioned, and linked to incidents/problems?
- Does assisted triage refuse to guess when data is missing?
- Are tool failures handled with retry/backoff and degraded read-only mode?
- Is low confidence routed to a human with clear options?
- Are guardrail violations refused with allowed alternatives?
- Do we log fallback type and frequency, and alert on spikes?
- Are approvals, audit trails, and rollback steps mandatory for prod changes?
FAQ
Is this safe in regulated environments?
It can be, if you enforce least privilege, separation of duties, audit logs, and explicit approvals. The unsafe version is silent automation without evidence and traceability.
How do we measure value beyond ticket counts?
Use repeat rate, reopen rate, MTTR trend, change failure rate, backlog aging, and manual touch time. These show whether operations is getting easier.
What data do we need for RAG / knowledge retrieval?
Generalization: you need clean runbooks, known-error records, incident timelines, and links to evidence (logs/monitoring outputs). If knowledge is not maintained, retrieval will amplify outdated advice.
How to start if the landscape is messy?
Start with one process slice (e.g., interfaces or batch chains). Build a minimal knowledge set and evidence collection steps. Expand only after you can measure fewer repeats.
Will this replace L3/L4 experts?
No. It reduces time spent on searching, copying, and drafting. Experts still own diagnosis, design decisions, and risk acceptance.
What’s the biggest risk?
False confidence: an agent answering “as if it knows” when tools fail or data is missing. The source record is clear: explicit fallbacks preserve trust.
Next action
Next week, pick your top recurring incident pattern (interfaces, batch chains, master data, or authorizations) and run a 60-minute internal review: write one versioned runbook page, define the “safe tasks” an assistant may perform for that pattern, and add one explicit fallback rule for missing data (“stop and ask for inputs”). This single slice will show quickly whether you are moving from closure to prevention.
Source reference (required attribution): Dzmitryi Kharlanau (SAP Lead). “Failure Modes & Fallbacks: What Agents Do When Things Go Wrong”. Dataset bytes: https://dkharlanau.github.io
Agentic Design Blueprint — 2/19/2026
