Post-Incident Review:
Incident ID
INC-YYYYMMDD-NNN
Severity
SEV1 / SEV2 / SEV3
Status
Draft / In Review / Final
Author
@your-handle
Reviewers
@reviewer1, @reviewer2
Incident Start
YYYY-MM-DD HH:MM UTC
Incident Resolved
YYYY-MM-DD HH:MM UTC
PIR Due
YYYY-MM-DD (≤ 5 business days post-resolution)
Detection Source
Alert / User report / Internal observation
Blameless principle: this document analyses systems and decisions, not people. Phrasing should describe what the system enabled, not who failed.
Incident Summary
Two or three sentences a stakeholder can read in 30 seconds: what broke, who was affected, how long, how it was resolved.
Impact
Users affected: <count or %>
Duration of impact:
SLO budget consumed: <% of monthly error budget>
Revenue / business impact:
Reputational / partner impact:
Data loss or corruption: <yes / no, scope>
Timeline
All times UTC. Include detection, escalation, mitigation, recovery, communications. Aim for granularity of every meaningful event.
14:02
Latency p95 alert fires on api-monitoring dashboard
Alertmanager
14:04
Oncall acknowledges, begins triage
@oncall
14:09
Identifies indexer pool exhaustion as proximate cause
@oncall
14:14
Rolls back indexer to previous image tag
@oncall
14:16
Latency recovers
—
14:20
Status page incident closed
@oncall
Root Cause
The technical root cause, plus the organisational / process root cause if different. Use the "Five Whys" if it helps surface deeper issues, but the writeup should be prose, not a literal numbered list.
Detection
How was the incident detected?
Time-to-detection (TTD) from first symptom to first alert/report.
Was monitoring sufficient? If not, what was missing?
Resolution
Immediate mitigation that stopped customer impact.
Permanent fix (link to PR if merged, RFC if scoped).
Did the rollback / mitigation introduce its own risk?
What Went Well
Tools, processes, decisions that worked. Document so we don't accidentally regress.
What Went Poorly
Where time was lost, where confusion arose, where tooling failed. Be specific.
Action Items
Each action: owner, due date, tracking issue. Action items must close — track in the same ticket system as features.
1
Add saturation alert on indexer DB pool
Prevention
@platform
2026-XX-XX
#NNNN
2
Document indexer rollback runbook
Mitigation
@docs
2026-XX-XX
#NNNN
3
Add chaos test for indexer pool exhaustion
Detection
@qa
2026-XX-XX
#NNNN
Action types:
Prevention — stops a recurrence.
Detection — surfaces it faster next time.
Mitigation — reduces blast radius if it happens again.
Process — improves how we respond.
Lessons Learned
The strategic takeaway, beyond the action items. What does this incident say about our priorities, architecture, or staffing? One or two paragraphs.
Supporting Evidence
Links to dashboards (with time range fixed)
Log excerpts (redacted if needed)
Slack thread permalinks
Affected PRs / commits
Last updated