RFC 0001: Release Trains for Lithosphere services
Status
Draft — needs validator-team sign-off
Author(s)
@bachal-abro
Sponsor
TBD (validator team lead)
Created
2026-05-11
Last Updated
2026-05-11
Discussion
Open a PR review of this file when ready
Supersedes
(none)
Summary
Today every push to main that touches Makalu/** rolls straight into mainnet through deploy-simple.yaml. There is no scheduled cadence and no separation between "feature work landed in main" and "feature work went live." This RFC proposes a three-track release calendar — weekly dev, bi-weekly staging, on-demand hotfix — that batches changes, gives the validator team a predictable promotion window, and makes post-incident review (PIR) easier because each train is a discrete unit.
Motivation
Status quo failure modes observed in the 2026-04 → 2026-05 window:
Surprise deploys. Pushing to
maindeploys immediately. Validator team has reported being paged for a deploy that nobody on shift knew was coming — typically because the change was a documentation tweak that rebuilt and force-recreated the api container as a side effect.No batching. Independent fixes deploy one at a time, multiplying the change-failure rate (every push is a chance for the deploy step itself to fail). The 2026-05-11 push storm (commits
6d95249→f0aff8f, 13 deploys in 6 hours) exercised the rollback path more than it needed to.No coordination with chain ops. When chain ops needs a quiet window (e.g. mtest-val-01 re-roll on 2026-05-08), there is no convention for pausing service deploys — the deploy gate is an admin toggle, not a scheduled lull.
PIR difficulty. Post-incident reviews currently span N commits across M hours with no narrative grouping. Trains give incident responders a single "what was in release 2026-05-13" reference.
Cost of inaction: deploy volume scales linearly with engineering velocity; the lack of cadence means the validator team has to keep all hours "on-call equivalent" rather than carving deploy windows.
Goals & Non-Goals
Goals
Define a written cadence for service releases that the validator team agrees to monitor and the dev infra team agrees to schedule against.
Provide a release-calendar template owners can fill in week-by-week.
Keep the existing zero-friction "push to main → deploy" mechanic available for hotfixes (don't slow down genuine emergencies).
Make the difference between "code merged" and "code deployed" observable — every train cuts a tag, every deploy maps back to a tag.
Non-Goals
Replacing the GitHub Environments approval gate from Phase 4 — trains build on top of that gate, not in place of it.
Defining new CI gates — the existing CI / publish-images / abi-sync / license-check / schema-sync set is sufficient.
CAB (Change Approval Board) approvals — covered separately under the Phase 11 governance umbrella; this RFC scopes to cadence.
Detailed Design
The three tracks
dev
Weekly, Tuesday 14:00 UTC
manual workflow_dispatch tag
dev infra + early validators
→ staging after 7 d soak
staging
Bi-weekly, alternate Wednesdays 14:00 UTC
manual workflow_dispatch tag
full validator set
direct (no further promotion today; production = testnet)
hotfix
On-demand
workflow_dispatch with release_type=hotfix
whoever paged
direct deploy, no soak
Times are UTC and intentionally outside US-East trading hours but inside EU/AsiaPac business hours.
Tag scheme
v0.5.0-dev.3— third dev train of the v0.5 linev0.5.0-staging.1— promotion of v0.5.0-dev.N to stagingv0.5.0-hotfix.1— emergency patch off the current staging tag
The tag is what publish-images.yaml already keys off (type=sha,prefix=sha-
type=semverpatterns from the existingdocker/metadata-actionconfig). No workflow code change needed to support semver tags — the tag triggers existing publish + Cosign sign + SBOM + Trivy gate.
Calendar artifact
A new docs/governance/release-calendar.md (sibling to this RFC) lists the next four trains plus their on-call assignee. Updated weekly during the dev cut.
Hotfix path
Unchanged from today's workflow_dispatch on deploy-simple.yaml. The RFC accepts that emergency response trumps cadence; the calendar simply records hotfixes after the fact for the PIR.
A hotfix that succeeds becomes the basis for the next staging tag — i.e. hotfixes don't rebase out, they roll forward into the train.
Architecture
No new components. The release-train mechanism is entirely a tag-naming convention + a markdown calendar + a written sign-off cadence. All existing CI/CD flows are reused.
Data Model / Schema Changes
None.
API / Interface Changes
None for runtime APIs. The SDK's release process (release-process.md) is unaffected — it follows semver on tag pushes, which already aligns with the train tag scheme.
Operational Considerations
Deploy concurrency:
deploy-simple.yamlalready hasconcurrency: deploy-${{ env }}withcancel-in-progress: false. Trains land sequentially.Observability: the Phase 9
build_infoPrometheus gauge already surfaces the deployed sha + version; with the new tag scheme it will displayversion=v0.5.0-staging.1instead ofsha-abc1234. That's the operator-friendly format.Calendar drift: if the assigned on-call cannot cut the train, the RFC's escalation rule is to skip — never run an unmonitored train. Better an empty week than a deploy nobody's watching.
Security & Privacy
No new attack surface. Tag pushes are already gated by the Required Reviewers protection rule on the mainnet GitHub Environment (see deployment-approvals.md). Trains don't bypass that gate.
Alternatives Considered
Continuous deploy with feature flags. Rejected: no feature-flag infrastructure exists today; building it is a separate multi-week project. Trains buy similar coordination benefit at near-zero cost.
Daily releases. Rejected: too noisy for a validator team that doesn't operate 24/7 yet; the savings from batching are real.
One unified weekly train (no dev/staging split). Rejected: makes the dev-vs-validator coordination interface explicit. With one track the dev team can't run a soak window without bothering the validator team about each tag.
Use
release-pleaseautomation to manage tags. Considered for follow-up. This RFC keeps tags manual to start so the cadence and human review pattern stabilize first; automation is a v2 concern.
Drawbacks
Slower mean-time-to-prod for non-urgent changes. A bug fix landed on Wednesday waits until next Tuesday's dev train. Mitigated by the hotfix path for anything genuinely urgent.
Coordination overhead. Someone has to keep the calendar accurate and rotate the on-call. ~30 min/week.
Calendar can be ignored. If
git push origin mainstill deploys, the train tags are optional. We'd need to either (a) flip the per-push deploy off and require a tag, or (b) trust the team to honour the convention. This RFC defers that decision to validator-team sign-off — proposed default is (b) for the first month, then revisit.
Rollout Plan
2026-05-12 — RFC reviewed by validator team. Sign-off captured in a comment on the PR or this file.
2026-05-13 — Calendar artifact published with the next four trains scheduled. On-call assignees filled in.
2026-05-13 → 2026-06-13 — Run cadence in advisory mode. Per-push deploys remain enabled; trains are additional. Observe friction points.
2026-06-13 — Retro: decide whether to (a) keep both flows, (b) tighten per-push to non-Makalu paths only and require tags for Makalu changes, or (c) hold the line and revisit in another 30 days.
Unresolved Questions
Success Metrics
Measured 30 days after sign-off:
Predictable cadence
Tuesday/Wed deploys as % of all deploys
≥ 70%
Reduced surprise deploys
Validator-team pages for unscheduled deploys
0 (excluding hotfixes)
Coordinated chain ops
Number of train postponements due to chain ops conflicts
recorded, baseline target ≤ 1/month
PIR usability
PIRs that cite a train tag (rather than a commit range)
100%
References
deployment-approvals.md— the protection-rules layer this RFC builds onrelease-process.md— the npm release flow (SDK), unaffectedpir-template.md— train tags slot in cleanly to the PIR "What changed" section.github/workflows/deploy-simple.yaml— already supportsworkflow_dispatchwith environment input.github/workflows/publish-images.yaml— already keys image tagging off semver viadocker/metadata-action
Last updated