Observability & Correlation Runbook
Three layers of correlation tie an alert, log line, or user-reported request back to the commit that produced the running binary:
Build — every image carries the git SHA, build time, and version as runtime env vars and OCI labels.
/versionand a Prometheusbuild_infogauge surface them.Request — the api stamps every incoming HTTP request with an
X-Request-Id(caller-provided or generated UUID), stored in anAsyncLocalStorageso every log line in the request scope carries it.fetchWithRequestIdpropagates it to downstream services.Span —
@opentelemetry/sdk-nodeis wired into both api and indexer. Spans are emitted whenOTEL_EXPORTER_OTLP_ENDPOINTis set; otherwise the SDK no-ops with zero runtime cost.
This document is the operational guide. Read top-to-bottom the first time; skip to § Where to look first when an incident lands.
Where to look first
"Production is broken — which commit?"
curl https://makalu.litho.ai/api/version
# { "service": "lithosphere-api",
# "gitSha": "ebc449d...", "buildTime": "2026-05-11T19:42:13Z",
# "version": "sha-ebc449d", "nodeVersion": "v20.x", "uptimeSec": 84 }gitSha is the answer. Cross-reference with git log to see exactly what landed since the last known-good /version snapshot.
"Which version is each service running?"
Open Grafana → Build Info panel on the overview dashboard. PromQL:
litho_api_build_info{} or litho_indexer_build_info{}The gauge value is always 1; the labels (git_sha, build_time, version, node_version) carry the data. If api and indexer show different git_sha, you have a partial deploy — investigate the failing service.
"User reported a 500. What happened?"
If they include the X-Request-Id header from their bug report:
You get every log line for that request, including db queries (when those sites are migrated to logger) and downstream fetchWithRequestId calls (which propagate the same id to the indexer / Cosmos RPC).
If they did not include the header but you have an approximate timestamp, filter by url and time range first to narrow to candidates, then pull requestId from the matching request_completed log to grab the rest.
"A particular request was slow — what did it spend time on?"
If OTEL_EXPORTER_OTLP_ENDPOINT is set and a collector is receiving spans, open Jaeger/Tempo and filter by the same requestId (carried as a span attribute by the OTel http auto-instrumentation). You'll see the full request waterfall: middleware → handler → db → outbound HTTP.
Without a collector deployed: this view is unavailable. Use the durationMs field on the request_completed log line for a coarse number and isolate deeper by reading the handler code.
The three layers, in detail
1. Build metadata
Injected at: docker build time by publish-images.yaml (build-args GIT_SHA=${{ github.sha }}, BUILD_TIME, VERSION) and by deploy-simple.yaml (git rev-parse HEAD of the cloned source).
Surfaced at:
GET /versionon api (:4000) and indexer (:3001) — JSON withgitSha,buildTime,version,nodeVersion,uptimeSec.GET /healthon api —versionfield now reads from the env (was hardcoded'1.0.0'pre-Phase-9).Prometheus:
litho_api_build_info/litho_indexer_build_infogauges.OCI labels on the image:
org.opencontainers.image.revision,.created,.version,.source,.title. Visible viadocker inspectand on the GHCR package page.
Read it from code: readBuildInfo() in Makalu/{api,indexer}/src/lib/build-info.ts. Both packages keep an independent copy — they ship as independent Docker images with no shared workspace dep.
2. Request IDs
Generated at: the api's request-ID middleware in Makalu/api/src/index.ts. Reads X-Request-Id from the incoming request; falls back to crypto.randomUUID().
Propagated by:
Pino
mixin()— everylogger.info(...)/logger.error(...)inside a request scope automatically includesrequestIdin its JSON output.fetchWithRequestId()fromMakalu/api/src/lib/logger.ts— wraps globalfetch, injects the currentrequestIdasX-Request-Idon outgoing calls. Use this instead of barefetchin route handlers when you want the trace to cross service boundaries.Response header — every response echoes
X-Request-Id, so clients can include it in bug reports without server-side log digging.
Indexer side: not request-driven. Each polling iteration generates a cycleId (also a UUID) and wraps its work in cycleIdStore.run(...). Logs from a single cycle group together by cycleId. fetchWithCycleId() is the indexer's equivalent of fetchWithRequestId for outgoing RPC calls.
Migration status: the highest-traffic logs (request lifecycle, indexer cycle start/error/fatal, server startup) are switched to pino. Helper-level console.log calls in routes.ts, litho.ts, and mappings.ts are intentionally left for a follow-up — converting them is mechanical and better done in small reviewable batches.
3. OpenTelemetry tracing
Wired at: Makalu/api/src/tracing.ts and Makalu/indexer/src/tracing.ts, imported as side-effect at the very top of index.ts / mappings.ts.
Enabled by: setting OTEL_EXPORTER_OTLP_ENDPOINT to your collector URL. Example values:
http://localhost:4318— local OTel collector (see § Running locally)http://tempo.litho.internal:4318— production (hypothetical, not yet deployed)
When unset (production today), the entire OTel stack is dormant — no spans, no exporter connections, no perf cost. The SDK packages are still installed so the moment a collector lands, no service-code change is needed.
Auto-instrumentations enabled: express, http, pg, fetch. Disabled: fs, dns (too noisy).
Resource attributes attached to every span: service.name, service.version, litho.git_sha, litho.build_time. So a slow-trace investigation in Jaeger/Tempo can directly identify the build that produced the trace, without round-tripping through /version.
Running locally
The collector logs (terminal 1) will show the exported spans with the matching request_id / litho.git_sha attributes.
For a fuller stack, see the OTel docs on piping the collector to Jaeger or Tempo.
SLO dashboard
Grafana panel "Lithosphere — SLO" (uid lithosphere-slo) at Makalu/infra/grafana/dashboards/slo.json is the single-pane view for "are we meeting our service objectives?".
Four stat panels at the top:
Availability (24h) —
1 - 5xx/totalover a trailing day. Target ≥ 99.9%.Error Rate (5xx, 5m) — recent share of 5xx responses. Sustained > 1% is the regression signal.
Request Rate (RPS) — total API throughput.
Indexer Chain Lag —
litho_indexer_chain_lag_blocks. Sustained > 100 blocks is the indexing alarm.
Two time-series panels:
Request Latency p50 / p95 / p99 —
histogram_quantile()overlitho_api_http_request_duration_seconds_bucket.Requests by Status Class — stacked 2xx/3xx/4xx/5xx rates so post-deploy regressions are visually obvious.
One table:
Build Info — joins
litho_api_build_info+litho_indexer_build_infoso the operator can see whichgit_shaeach service is running. If they diverge, it's a partial deploy.
The underlying metrics ship from Makalu/api/src/lib/http-metrics.ts. Routes are labeled by Express's normalized req.route.path (e.g. /api/blocks/:height), not the raw URL, so cardinality stays bounded.
Cost dashboard
Grafana panel "Lithosphere — Cost" (uid lithosphere-cost) at Makalu/infra/grafana/dashboards/cost.json gives operators a single-pane view of "where is my VPS spend going?".
Top row — four stat panels for a 24h spend snapshot:
Monthly Budget — operator-set via the
monthlyBudgetUsdtemplate variable (default $60 ≈ t3.large on-demand). Adjust via the Grafana UI or by editing the dashboard JSON.Host CPU (5m) — fraction of cores busy. Sustained > 80% means scale-up is worth pricing.
Host Memory (used) — fraction of physical RAM. Sustained > 85% means swap is imminent.
Disk Free — bytes available on
/. The early-warning is yellow at 5 GB.
Middle rows — time series for the underlying drivers:
Network Egress (bytes/s) — host network out; the dominant variable cost above the instance fee on EC2.
Per-Container CPU — sum across containers reconciles to the Host CPU stat. Identifies which
litho-*container drives the bill.Per-Container Memory — same idea for RAM.
API Egress per 1k Requests (24h) —
bytes_out / api_requests * 1000. Multiply by your provider's per-GB bandwidth price for a marginal cost-per-1k-requests figure useful for traffic-growth forecasting.
Bottom row — Resource Footprint Summary table, one row per container with CPU cores / working-set memory / network out columns. The row driving any of the host-level numbers above is your scale-down candidate (or scale-up justification).
Glossary
gitSha
/version, litho_*_build_info, OCI labels, span attributes
Image → commit
requestId
api logs, response X-Request-Id header, outgoing X-Request-Id
Single HTTP request → all work it triggered
cycleId
indexer logs, indexer outgoing X-Request-Id
One polling iteration → all logs from it
trace_id
OpenTelemetry spans
Distributed work across services (requires collector)
Out of scope
Collector deployment. SDK is wired, env-gated. Standing up a receiver (Jaeger / Tempo / Grafana Cloud) is a separate infra ticket — likely validator-team-coordinated since it runs alongside Prometheus / Loki on the VPS.
SLO / cost dashboards. Analysis layers on top of metrics we already collect; deferred.
Migrating every
console.*tologger. Done for the high-traffic paths; the long tail is mechanical and incremental.Explorer (Next.js) instrumentation. Needs Next's
instrumentation.tspattern, not the Node SDK pattern. Server-side spans there are a separate body of work; client-side/versionenv propagation IS in place so the UI can display the deployed build.
Last updated