Key & Secret Rotation Runbook
Operational procedures for routine and emergency rotation of secrets, credentials, and signing material across the Lithosphere stack.
Schedule and SLAs are defined in
docs/guides/security-checklist.md. This runbook covers how, not when.
Pre-flight (every rotation)
Confirm an active maintenance window or that the secret can be rotated zero-downtime (most below can).
Notify
#litho-oncallwithSTART-ROTATION <secret-type> <env>.Open a tracking issue using the Incident / Operational Change issue template (label:
rotation).Have a rollback plan ready: keep the previous secret active until the new one is verified in use.
Authority Matrix
API Keys (3rd party)
Oncall
Security Lead
—
DB Credentials
Oncall
Security Lead
DBA
TLS Certificates
Automated (ACM)
Security Lead
—
Cosign Signing Keys
Security Lead
CAB
Release Eng
SSH Keys (bastion)
Oncall
Security Lead
—
Root CA
CAB
CAB + CTO
Security Lead
GHCR / Registry Tokens
Release Eng
Security Lead
—
AWS OIDC Role Trust
Platform Eng
Security Lead
—
Routine Rotation — by Secret Type
API Keys (third-party services, 90 days)
Examples: WalletConnect / Reown project ID rotation, RPC provider keys.
Generate new key in the provider console. Do not revoke the old one yet.
Stage new value in AWS Secrets Manager:
Roll the consuming service(s) —
docker compose up -d --force-recreate <service>on the indexer EC2 (or triggerdeploy-simple.yamlif the secret is read at build time).Verify: confirm the service emits no auth-error metrics for 15 minutes; check provider dashboard shows traffic against the new key.
Revoke the old key in the provider console.
Close the rotation issue with the new key's last-4 and rotation timestamp.
Database Credentials (90 days)
Create a new IAM user / Postgres role with the same grants:
Update Secrets Manager (
litho/prod/postgres/url).Restart
litho-apiandlitho-indexerservices. Both must reconnect cleanly.Drop the old role only after 24h of successful operation:
TLS Certificates (90 days, automated)
AWS ACM auto-renews 60 days before expiry. Manual intervention only if:
DNS validation drifts: re-issue via
aws acm request-certificate ....Nginx on Sentry 1 isn't picking up renewal:
sudo nginx -s reload.
Verify:
Cosign Signing Keys (annual)
Note: We use Cosign keyless via OIDC in CI. Long-lived signing keys are reserved for offline signing scenarios. If/when we adopt offline keys, this section applies.
Generate a new keypair offline:
Store
cosign.keyin AWS Secrets Manager (litho/sign/cosign/private), uploadcosign.pubtodocs/security/and tag in git.Update
publish-images.yamlto reference the new key id.Re-sign the latest mainnet release with both old and new keys for a 30-day grace window.
After 30 days, remove the old public key from the verification policy.
SSH Keys — Bastion (annual)
Generate new keypair locally:
Add public key to the bastion via SSM (do NOT log into bastion to do this — it must come from SSM in case the old key is compromised):
Verify connectivity from a new shell.
Remove the old public key:
Update GitHub Actions secret
BASTION_SSH_KEY(same procedure forINDEXER_SSH_KEY).Re-run the latest successful
deploy-simple.yamlto confirm CI still has access.
Root CA (5 years)
Out of scope for this runbook — coordinated through CAB with a dedicated migration plan. Touchpoint here is procedural: ensure docs/governance/ carries the migration plan from the prior cycle, and that the new CA is published in docs/security/ with overlapping validity ≥ 12 months.
GHCR / Registry Tokens
Per-repo GHCR uses GITHUB_TOKEN short-lived via permissions: packages: write — no manual rotation. If a long-lived PAT was created in error, revoke it immediately at https://github.com/settings/tokens.
AWS OIDC Role Trust Policy
No secret to rotate, but review the trust policy on litho-mainnet-github-actions-deployer annually:
Confirm token.actions.githubusercontent.com is the only federated principal and that sub claim restricts to the expected repo and refs.
Emergency Rotation
Triggered by suspected compromise. Follow the SLAs in security-checklist.md:
Critical
1 hour
4 hours
High
4 hours
24 hours
Medium
24 hours
72 hours
Procedure:
Page security-lead and oncall (PagerDuty:
litho-securitypolicy).Revoke first, rotate second. Revocation is the SLA-critical step; deploying the new secret can take longer if rollback safety requires it.
Audit: pull CloudTrail / GitHub audit log for the last 30 days filtered by the compromised secret. Save to an incident folder in S3 (
s3://litho-incidents/<date>-<short-desc>/).Isolate blast radius: identify every workload that consumed the secret and confirm they're rotated or shut down before any external announcement.
PIR within 5 business days using
pir-template.md.
Quarterly Drill
Phase 10 acceptance criterion: "Quarterly security drill passes." Each quarter, run a tabletop rotation of one randomly-selected secret type without coordinating in advance with the wider team. Record results in the PIR template and file action items for any gaps surfaced.
Verification After Any Rotation
Last updated