Phase 9 — Observability & Correlation
Status: ~90% (2026-05-11). Build metadata, structured logging, request-ID correlation, and the OpenTelemetry SDK skeleton all shipped. Collector deployment and the SLO / cost dashboards are deferred.
What this phase covers
The work plan's Phase 9 acceptance criterion: an operator can pull on any single thread — a Prometheus alert, a log line, a slow request — and trace it back to the commit that produced the running binary.
Coming into this phase that wasn't possible. /health returned a hardcoded version: '1.0.0'. Every container's process.env had no link back to git. Every log was unstructured console.log output with no per-request context. Zero OpenTelemetry packages anywhere. An incident responder traced "production is broken" → "which commit?" by SSHing to the bastion and running git rev-parse HEAD — a 2–3 step manual correlation.
Leaving the phase: three discoverability layers (build metadata, request IDs, OTel spans) all wired end-to-end through CI → image → container → runtime. Loki ingests the new JSON logs natively; Prometheus has a build_info gauge to join metrics back to the deployed commit; the OTel SDK no-ops until a collector is deployed (zero production runtime cost today, zero service-code change required when one lands).
What we built
Build metadata propagation (commit 20f0233)
20f0233)The chain from git push to a running container now carries GIT_SHA, BUILD_TIME, and VERSION at every step.
publish-images.yaml
Computes build_time (RFC 3339) and version (tag if pushed, sha-<short> otherwise) once per run. Passes all three as --build-arg to api + indexer + explorer image builds.
Dockerfiles
ARG GIT_SHA / BUILD_TIME / VERSION in the runner stage; promoted to ENV so they're visible to the running process. Emits OCI labels (org.opencontainers.image.{revision,created,version,source,title}) so docker inspect and the GHCR UI surface the chain.
deploy-simple.yaml
Captures git rev-parse HEAD right after git clone --depth 1 (before the temp dir is removed) and writes GIT_SHA / BUILD_TIME / VERSION into the bastion's .env. Docker Compose threads the values through build: { args } so bastion-side rebuilds carry the same metadata.
docker-compose.yaml
Both build-time args: and runtime environment: pass the three vars to api, indexer, and explorer.
api / indexer
New /version endpoint returns {service, gitSha, buildTime, version, nodeVersion, uptimeSec}. /health now reads version from the env. New Prometheus gauges litho_api_build_info and litho_indexer_build_info (value = 1, labels carry the data — standard Prometheus build-info pattern).
Helpers live at Makalu/{api,indexer}/src/lib/build-info.ts (independent copies — the two packages ship as independent Docker images with no shared workspace dep).
Structured logging with request IDs (commit 8bf7fda)
8bf7fda)Replaces console.log with pino JSON output and wires per-request correlation.
Makalu/api/src/lib/logger.ts
Root pino logger + requestIdStore (AsyncLocalStorage) + resolveRequestId() + fetchWithRequestId()
Makalu/indexer/src/lib/logger.ts
Root pino logger + cycleIdStore + fetchWithCycleId()
Request-ID middleware
Reads X-Request-Id from the client or generates a UUID. Echoes it back as a response header. Stores it in ALS so any downstream await sees the same id. Logs a request_completed line on response finish with method, url, status, durationMs.
Pino mixin()
Every log line in a request/cycle scope automatically includes requestId / cycleId — no arg-threading needed at call sites.
fetchWithRequestId
Wraps global fetch to inject the current requestId as X-Request-Id on outgoing calls. Caller can override with an explicit header. Falls through to plain fetch outside a request scope.
JSON output is what Loki + Promtail (already running in Makalu/infra/) ingest natively — flipping the format light up full-text search in Grafana with zero infra changes.
Migration scope: highest-traffic logs (server startup, metrics_server_listening, request_completed, cycle_sync_range, cycle_failed, fatal) are switched. The long tail of helper-level console.log calls in routes.ts / litho.ts is intentionally deferred — it's mechanical work better done in small reviewable batches.
OpenTelemetry SDK (commit ebc449d)
ebc449d)Env-gated on OTEL_EXPORTER_OTLP_ENDPOINT. Wired in both api and indexer.
When env unset
SDK packages installed but never instantiated. Zero perf cost, no network calls, no transitive-package load on cold start (lazy-require keeps them off the import path).
When env set
NodeSDK starts with the standard auto-instrumentations pack (express, http, pg, fetch — fs and dns disabled as noise). Spans export via OTLP/HTTP to ${endpoint}/v1/traces.
Span resource attributes
service.name, service.version, litho.git_sha, litho.build_time — so a slow trace in Jaeger/Tempo identifies the build that produced it.
Lifecycle
Idempotent (startTracing() no-ops if already started). Graceful shutdown on SIGTERM / SIGINT flushes remaining spans.
Wiring
import './tracing.js' is the very first import in src/index.ts / src/mappings.ts. Must precede express / pg / fetch imports so auto-instrumentation can patch them.
The point of shipping this dormant: the moment a collector lands (Jaeger, Tempo, or Grafana Cloud), no service-code change is required to start producing traces.
Documentation
New runbook at
docs/governance/observability.md: the three correlation layers, where to look first per incident type, worked examples for build-info / request-id / span lookups, a local-dev walkthrough with an OTel collector, and a glossary.This phase completion report, following the existing
docs/phases/pattern.
How to use what was built
See which commit is running in production:
Find every log line for a single request:
(Once the helper-level migrations land — currently only the high-traffic lifecycle logs carry requestId automatically. Adding more is a one-line change per call site: logger.info({ ... }, '...') instead of console.log.)
Compare deployed builds between services:
If git_sha differs across services you have a partial deploy.
Enable distributed tracing locally:
Full walkthrough in observability.md § Running locally.
Why it matters
Incident response is now a single curl. "Which commit?" answered in one HTTP call instead of an SSH-and-
git logdance./versionis on the same port the load balancer already exposes — no firewall changes needed.Build-info gauges close the metrics-to-commit gap. Today every Prometheus alert that fires has zero idea which build produced it. Post-Phase-9,
litho_*_build_info{}is a one-line join from a metric to the deployed sha.JSON logs unlock Loki immediately. The Grafana stack already runs Loki and Promtail. Before this commit, log search was string-greps on unstructured
console.log; after, it's structured queries on a JSON schema. Zero new infra.Request IDs make user-reported bugs actionable. A user with a 500 page can include the
X-Request-Idheader (it's in their browser's network tab) and ops can grab the full log thread in one query.OTel SDK is free insurance. Wired now, off by default. The day a collector lands, distributed tracing is one env var away. Doing it later means another full deploy cycle for what could have been a config flip.
Loose coupling beats lock-in. The SDK exports via standard OTLP/HTTP, meaning Jaeger / Tempo / Grafana Cloud / Datadog all accept it without code changes. We don't have to pick a vendor today to keep the option open.
Files & commits
.github/workflows/publish-images.yaml
edit (build-args for GIT_SHA / BUILD_TIME / VERSION)
.github/workflows/deploy-simple.yaml
edit (capture SHA at clone, inject env on bastion)
Makalu/api/Dockerfile
edit (ARG + ENV + OCI labels)
Makalu/indexer/Dockerfile
edit (same)
Makalu/explorer/Dockerfile
edit (same — explorer also gets /version env)
Makalu/docker-compose.yaml
edit (build: { args } + environment: passthrough, OTEL + LOG_LEVEL vars)
Makalu/api/package.json
edit (pino, pino-http, six @opentelemetry/* packages)
Makalu/indexer/package.json
edit (pino, six @opentelemetry/* packages)
Makalu/api/src/index.ts
edit (/version, request-id middleware, request_completed log, top-of-file tracing import)
Makalu/api/src/lib/build-info.ts
create
Makalu/api/src/lib/logger.ts
create
Makalu/api/src/tracing.ts
create
Makalu/api/src/__tests__/build-info.test.ts
create (5 tests)
Makalu/api/src/__tests__/logger.test.ts
create (12 tests)
Makalu/api/src/__tests__/tracing.test.ts
create (2 tests)
Makalu/indexer/src/mappings.ts
edit (/version, cycleId scope around poll loop, top-of-file tracing import)
Makalu/indexer/src/lib/build-info.ts
create
Makalu/indexer/src/lib/logger.ts
create
Makalu/indexer/src/tracing.ts
create
Makalu/indexer/src/__tests__/build-info.test.ts
create (3 tests)
docs/governance/observability.md
create
docs/phases/phase-9-completion.md
create
_sidebar.md
edit (surface observability runbook + this report)
Commits: 20f0233 (build metadata), 8bf7fda (structured logging), ebc449d (OpenTelemetry SDK), + this docs commit. 22 new tests (api +19, indexer +3) — total now 122 api + 41 indexer.
Deferred work
OTel collector deployment. Wiring is dormant until a receiver (Jaeger / Tempo / Grafana Cloud / Datadog) is stood up. Likely validator-team coordinated — runs alongside Prometheus + Loki on the VPS or in a dedicated container. Separate ticket.
SLO + cost dashboards. The work-plan's Phase 9 lists both. They're analysis layers on top of metrics we already collect; deferred.
Helper-level
console.log→loggermigration. This phase ships enough call sites (lifecycle + cycle + startup) to prove the pattern and unblock Loki queries. The long tail (routes.ts,litho.ts,mappings.ts) is mechanical and best done in small batches.Explorer (Next.js) instrumentation. Server-side tracing in Next uses an
instrumentation.tsregister pattern, not the NodeSDK pattern. Out of scope for this phase./versionenv vars ARE propagated (so the UI can display the deployed build), but no span emission.Promoting OTel from advisory to production-required. Once a collector is deployed and we have signal on what cardinality / cost looks like in steady state, we can flip from "env-gated, off in prod" to "always on, sampling configured."
Last updated