> For the complete documentation index, see [llms.txt](https://whitepaper.litho.ai/llms.txt). Markdown versions of documentation pages are available by appending `.md` to page URLs; this page is available as [Markdown](https://whitepaper.litho.ai/docs/governance/pir-template.md).

# Post-Incident Review:

|                       |                                                |
| --------------------- | ---------------------------------------------- |
| **Incident ID**       | INC-YYYYMMDD-NNN                               |
| **Severity**          | SEV1 / SEV2 / SEV3                             |
| **Status**            | Draft / In Review / Final                      |
| **Author**            | @your-handle                                   |
| **Reviewers**         | @reviewer1, @reviewer2                         |
| **Incident Start**    | YYYY-MM-DD HH:MM UTC                           |
| **Incident Resolved** | YYYY-MM-DD HH:MM UTC                           |
| **PIR Due**           | YYYY-MM-DD (≤ 5 business days post-resolution) |
| **Detection Source**  | Alert / User report / Internal observation     |

> **Blameless principle**: this document analyses systems and decisions, not people. Phrasing should describe what the system enabled, not who failed.

## Incident Summary

Two or three sentences a stakeholder can read in 30 seconds: what broke, who was affected, how long, how it was resolved.

## Impact

* **Users affected**: \<count or %>
* **Duration of impact**:
* **SLO budget consumed**: <% of monthly error budget>
* **Revenue / business impact**:
* **Reputational / partner impact**:
* **Data loss or corruption**: \<yes / no, scope>

## Timeline

All times UTC. Include detection, escalation, mitigation, recovery, communications. Aim for granularity of every meaningful event.

| Time  | Event                                                 | Actor        |
| ----- | ----------------------------------------------------- | ------------ |
| 14:02 | Latency p95 alert fires on `api-monitoring` dashboard | Alertmanager |
| 14:04 | Oncall acknowledges, begins triage                    | @oncall      |
| 14:09 | Identifies indexer pool exhaustion as proximate cause | @oncall      |
| 14:14 | Rolls back indexer to previous image tag              | @oncall      |
| 14:16 | Latency recovers                                      | —            |
| 14:20 | Status page incident closed                           | @oncall      |

## Root Cause

The **technical** root cause, plus the **organisational / process** root cause if different. Use the "Five Whys" if it helps surface deeper issues, but the writeup should be prose, not a literal numbered list.

## Detection

* How was the incident detected?
* Time-to-detection (TTD) from first symptom to first alert/report.
* Was monitoring sufficient? If not, what was missing?

## Resolution

* Immediate mitigation that stopped customer impact.
* Permanent fix (link to PR if merged, RFC if scoped).
* Did the rollback / mitigation introduce its own risk?

## What Went Well

* Tools, processes, decisions that worked. Document so we don't accidentally regress.

## What Went Poorly

* Where time was lost, where confusion arose, where tooling failed. Be specific.

## Action Items

Each action: owner, due date, tracking issue. Action items must close — track in the same ticket system as features.

| # | Action                                     | Type       | Owner     | Due        | Issue |
| - | ------------------------------------------ | ---------- | --------- | ---------- | ----- |
| 1 | Add saturation alert on indexer DB pool    | Prevention | @platform | 2026-XX-XX | #NNNN |
| 2 | Document indexer rollback runbook          | Mitigation | @docs     | 2026-XX-XX | #NNNN |
| 3 | Add chaos test for indexer pool exhaustion | Detection  | @qa       | 2026-XX-XX | #NNNN |

Action types:

* **Prevention** — stops a recurrence.
* **Detection** — surfaces it faster next time.
* **Mitigation** — reduces blast radius if it happens again.
* **Process** — improves how we respond.

## Lessons Learned

The strategic takeaway, beyond the action items. What does this incident say about our priorities, architecture, or staffing? One or two paragraphs.

## Supporting Evidence

* Links to dashboards (with time range fixed)
* Log excerpts (redacted if needed)
* Slack thread permalinks
* Affected PRs / commits