> For the complete documentation index, see [llms.txt](https://whitepaper.litho.ai/llms.txt). Markdown versions of documentation pages are available by appending `.md` to page URLs; this page is available as [Markdown](https://whitepaper.litho.ai/docs/governance/pir-template.md).

# Post-Incident Review:

|                       |                                                |
| --------------------- | ---------------------------------------------- |
| **Incident ID**       | INC-YYYYMMDD-NNN                               |
| **Severity**          | SEV1 / SEV2 / SEV3                             |
| **Status**            | Draft / In Review / Final                      |
| **Author**            | @your-handle                                   |
| **Reviewers**         | @reviewer1, @reviewer2                         |
| **Incident Start**    | YYYY-MM-DD HH:MM UTC                           |
| **Incident Resolved** | YYYY-MM-DD HH:MM UTC                           |
| **PIR Due**           | YYYY-MM-DD (≤ 5 business days post-resolution) |
| **Detection Source**  | Alert / User report / Internal observation     |

> **Blameless principle**: this document analyses systems and decisions, not people. Phrasing should describe what the system enabled, not who failed.

## Incident Summary

Two or three sentences a stakeholder can read in 30 seconds: what broke, who was affected, how long, how it was resolved.

## Impact

* **Users affected**: \<count or %>
* **Duration of impact**:
* **SLO budget consumed**: <% of monthly error budget>
* **Revenue / business impact**:
* **Reputational / partner impact**:
* **Data loss or corruption**: \<yes / no, scope>

## Timeline

All times UTC. Include detection, escalation, mitigation, recovery, communications. Aim for granularity of every meaningful event.

| Time  | Event                                                 | Actor        |
| ----- | ----------------------------------------------------- | ------------ |
| 14:02 | Latency p95 alert fires on `api-monitoring` dashboard | Alertmanager |
| 14:04 | Oncall acknowledges, begins triage                    | @oncall      |
| 14:09 | Identifies indexer pool exhaustion as proximate cause | @oncall      |
| 14:14 | Rolls back indexer to previous image tag              | @oncall      |
| 14:16 | Latency recovers                                      | —            |
| 14:20 | Status page incident closed                           | @oncall      |

## Root Cause

The **technical** root cause, plus the **organisational / process** root cause if different. Use the "Five Whys" if it helps surface deeper issues, but the writeup should be prose, not a literal numbered list.

## Detection

* How was the incident detected?
* Time-to-detection (TTD) from first symptom to first alert/report.
* Was monitoring sufficient? If not, what was missing?

## Resolution

* Immediate mitigation that stopped customer impact.
* Permanent fix (link to PR if merged, RFC if scoped).
* Did the rollback / mitigation introduce its own risk?

## What Went Well

* Tools, processes, decisions that worked. Document so we don't accidentally regress.

## What Went Poorly

* Where time was lost, where confusion arose, where tooling failed. Be specific.

## Action Items

Each action: owner, due date, tracking issue. Action items must close — track in the same ticket system as features.

| # | Action                                     | Type       | Owner     | Due        | Issue |
| - | ------------------------------------------ | ---------- | --------- | ---------- | ----- |
| 1 | Add saturation alert on indexer DB pool    | Prevention | @platform | 2026-XX-XX | #NNNN |
| 2 | Document indexer rollback runbook          | Mitigation | @docs     | 2026-XX-XX | #NNNN |
| 3 | Add chaos test for indexer pool exhaustion | Detection  | @qa       | 2026-XX-XX | #NNNN |

Action types:

* **Prevention** — stops a recurrence.
* **Detection** — surfaces it faster next time.
* **Mitigation** — reduces blast radius if it happens again.
* **Process** — improves how we respond.

## Lessons Learned

The strategic takeaway, beyond the action items. What does this incident say about our priorities, architecture, or staffing? One or two paragraphs.

## Supporting Evidence

* Links to dashboards (with time range fixed)
* Log excerpts (redacted if needed)
* Slack thread permalinks
* Affected PRs / commits


---

# Agent Instructions
This documentation is published with GitBook. GitBook is the documentation platform designed so that both humans and AI agents can read, navigate, and reason over technical content effectively. Learn more at gitbook.com.

## Querying This Documentation
If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter:

```
GET https://whitepaper.litho.ai/docs/governance/pir-template.md?ask=<question>
```

The question should be specific, self-contained, and written in natural language.
The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.
