This page is part of the ForgeSDLC knowledge base — an AI-assisted, human-directed methodology for taking product work from concept to production. For the core operating model and vocabulary, see Forge SDLC overview and What is ForgeSDLC?.
Incident management & response (blueprint)
Incident management is the disciplined process for detecting, responding to, and learning from production-impacting events. Organizational readiness — clear severity, roles, communication, and follow-through — matters as much as tooling. This guide complements observability and SLO practices in sre-observability.md and the cultural framing in DEVOPS.md.
1. Overview: lifecycle and readiness
Effective response depends on preparedness (runbooks, dashboards, escalation paths) and learning (blameless postmortems, tracked actions). Incidents are normal at scale; the system is how you shorten impact and prevent recurrence.
Cause summary (preliminary OK), customer impact duration, follow-up
Postmortem invite
Schedule learning
Link to doc, attendees, no-blame framing
6. Postmortem / retrospective structure
Section
Intent
Summary
What broke, for whom, how long
Blameless timeline
Factual sequence; no individual blame
Contributing factors
Multiple factors (not a single “root cause”) — people, process, tech, external
What went well / poorly
Honest assessment
Action items
Owner, due date, tracking link (ticket)
Follow-through
Review completion in operational forums
7. SEV1 response sequence (illustrative)
sequenceDiagram
participant M as Monitoring
participant P as Paging
participant O as On-call
participant W as War room
participant C as Comms
participant U as Users/stakeholders
M->>P: Alert fires
P->>O: Page
O->>P: Acknowledge
O->>W: Open bridge / channel
O->>W: Mitigate (rollback, scale, fix)
W->>C: Status for updates
C->>U: External updates
W->>O: Resolve / stabilize
O->>U: Resolution notice
8. Chaos engineering integration
Activity
Goal
Game days
Rehearse incident roles and tooling with controlled scenarios
Fault injection
Validate detection, runbooks, and graceful degradation
Hypothesis-driven experiments
“If we kill X, latency stays within SLO” — ties to SRE error budgets
Chaos is not random breakage in prod without safeguards; it follows steady-state hypotheses and blast-radius limits (see also sre-observability.md).
9. Metrics
Metric
Definition
Use
MTTA
Mean time to acknowledge alert
On-call health, routing quality
MTTD
Mean time to detect incident
Monitoring and SLO coverage
MTTR
Mean time to resolve / restore
Operational effectiveness
MTTF
Mean time between failures
Reliability engineering input
Incidents by severity
Count over window
Trend risk and investment
Postmortem completion rate
% incidents with closed actions
Learning culture indicator
10. Anti-patterns
Anti-pattern
Effect
Blame culture
Hides facts; repeats failures
Hero-dependent response
Bus factor; inconsistent outcomes
No postmortems
Same outages recur
Alert fatigue
Real incidents missed; see observability guide for alert design
11. Readiness checklist (before the pager fires)
Area
Question
Runbooks
Is there a first-response doc for top alert types?
Dashboards
Can on-call see golden signals and deploy correlation in one place?
Ownership
Is every critical path service mapped to a team and escalation?