Phase 9 — Observability & Correlation

Status: ~90% (2026-05-11). Build metadata, structured logging, request-ID correlation, and the OpenTelemetry SDK skeleton all shipped. Collector deployment and the SLO / cost dashboards are deferred.

What this phase covers

The work plan's Phase 9 acceptance criterion: an operator can pull on any single thread — a Prometheus alert, a log line, a slow request — and trace it back to the commit that produced the running binary.

Coming into this phase that wasn't possible. /health returned a hardcoded version: '1.0.0'. Every container's process.env had no link back to git. Every log was unstructured console.log output with no per-request context. Zero OpenTelemetry packages anywhere. An incident responder traced "production is broken" → "which commit?" by SSHing to the bastion and running git rev-parse HEAD — a 2–3 step manual correlation.

Leaving the phase: three discoverability layers (build metadata, request IDs, OTel spans) all wired end-to-end through CI → image → container → runtime. Loki ingests the new JSON logs natively; Prometheus has a build_info gauge to join metrics back to the deployed commit; the OTel SDK no-ops until a collector is deployed (zero production runtime cost today, zero service-code change required when one lands).

What we built

Build metadata propagation (commit 20f0233)

The chain from git push to a running container now carries GIT_SHA, BUILD_TIME, and VERSION at every step.

Layer
What happens

publish-images.yaml

Computes build_time (RFC 3339) and version (tag if pushed, sha-<short> otherwise) once per run. Passes all three as --build-arg to api + indexer + explorer image builds.

Dockerfiles

ARG GIT_SHA / BUILD_TIME / VERSION in the runner stage; promoted to ENV so they're visible to the running process. Emits OCI labels (org.opencontainers.image.{revision,created,version,source,title}) so docker inspect and the GHCR UI surface the chain.

deploy-simple.yaml

Captures git rev-parse HEAD right after git clone --depth 1 (before the temp dir is removed) and writes GIT_SHA / BUILD_TIME / VERSION into the bastion's .env. Docker Compose threads the values through build: { args } so bastion-side rebuilds carry the same metadata.

docker-compose.yaml

Both build-time args: and runtime environment: pass the three vars to api, indexer, and explorer.

api / indexer

New /version endpoint returns {service, gitSha, buildTime, version, nodeVersion, uptimeSec}. /health now reads version from the env. New Prometheus gauges litho_api_build_info and litho_indexer_build_info (value = 1, labels carry the data — standard Prometheus build-info pattern).

Helpers live at Makalu/{api,indexer}/src/lib/build-info.ts (independent copies — the two packages ship as independent Docker images with no shared workspace dep).

Structured logging with request IDs (commit 8bf7fda)

Replaces console.log with pino JSON output and wires per-request correlation.

Component
Purpose

Makalu/api/src/lib/logger.ts

Root pino logger + requestIdStore (AsyncLocalStorage) + resolveRequestId() + fetchWithRequestId()

Makalu/indexer/src/lib/logger.ts

Root pino logger + cycleIdStore + fetchWithCycleId()

Request-ID middleware

Reads X-Request-Id from the client or generates a UUID. Echoes it back as a response header. Stores it in ALS so any downstream await sees the same id. Logs a request_completed line on response finish with method, url, status, durationMs.

Pino mixin()

Every log line in a request/cycle scope automatically includes requestId / cycleId — no arg-threading needed at call sites.

fetchWithRequestId

Wraps global fetch to inject the current requestId as X-Request-Id on outgoing calls. Caller can override with an explicit header. Falls through to plain fetch outside a request scope.

JSON output is what Loki + Promtail (already running in Makalu/infra/) ingest natively — flipping the format light up full-text search in Grafana with zero infra changes.

Migration scope: highest-traffic logs (server startup, metrics_server_listening, request_completed, cycle_sync_range, cycle_failed, fatal) are switched. The long tail of helper-level console.log calls in routes.ts / litho.ts is intentionally deferred — it's mechanical work better done in small reviewable batches.

OpenTelemetry SDK (commit ebc449d)

Env-gated on OTEL_EXPORTER_OTLP_ENDPOINT. Wired in both api and indexer.

Aspect
Behaviour

When env unset

SDK packages installed but never instantiated. Zero perf cost, no network calls, no transitive-package load on cold start (lazy-require keeps them off the import path).

When env set

NodeSDK starts with the standard auto-instrumentations pack (express, http, pg, fetch — fs and dns disabled as noise). Spans export via OTLP/HTTP to ${endpoint}/v1/traces.

Span resource attributes

service.name, service.version, litho.git_sha, litho.build_time — so a slow trace in Jaeger/Tempo identifies the build that produced it.

Lifecycle

Idempotent (startTracing() no-ops if already started). Graceful shutdown on SIGTERM / SIGINT flushes remaining spans.

Wiring

import './tracing.js' is the very first import in src/index.ts / src/mappings.ts. Must precede express / pg / fetch imports so auto-instrumentation can patch them.

The point of shipping this dormant: the moment a collector lands (Jaeger, Tempo, or Grafana Cloud), no service-code change is required to start producing traces.

Documentation

  • New runbook at docs/governance/observability.md: the three correlation layers, where to look first per incident type, worked examples for build-info / request-id / span lookups, a local-dev walkthrough with an OTel collector, and a glossary.

  • This phase completion report, following the existing docs/phases/ pattern.

How to use what was built

See which commit is running in production:

Find every log line for a single request:

(Once the helper-level migrations land — currently only the high-traffic lifecycle logs carry requestId automatically. Adding more is a one-line change per call site: logger.info({ ... }, '...') instead of console.log.)

Compare deployed builds between services:

If git_sha differs across services you have a partial deploy.

Enable distributed tracing locally:

Full walkthrough in observability.md § Running locally.

Why it matters

  • Incident response is now a single curl. "Which commit?" answered in one HTTP call instead of an SSH-and-git log dance. /version is on the same port the load balancer already exposes — no firewall changes needed.

  • Build-info gauges close the metrics-to-commit gap. Today every Prometheus alert that fires has zero idea which build produced it. Post-Phase-9, litho_*_build_info{} is a one-line join from a metric to the deployed sha.

  • JSON logs unlock Loki immediately. The Grafana stack already runs Loki and Promtail. Before this commit, log search was string-greps on unstructured console.log; after, it's structured queries on a JSON schema. Zero new infra.

  • Request IDs make user-reported bugs actionable. A user with a 500 page can include the X-Request-Id header (it's in their browser's network tab) and ops can grab the full log thread in one query.

  • OTel SDK is free insurance. Wired now, off by default. The day a collector lands, distributed tracing is one env var away. Doing it later means another full deploy cycle for what could have been a config flip.

  • Loose coupling beats lock-in. The SDK exports via standard OTLP/HTTP, meaning Jaeger / Tempo / Grafana Cloud / Datadog all accept it without code changes. We don't have to pick a vendor today to keep the option open.

Files & commits

Path
What

.github/workflows/publish-images.yaml

edit (build-args for GIT_SHA / BUILD_TIME / VERSION)

.github/workflows/deploy-simple.yaml

edit (capture SHA at clone, inject env on bastion)

Makalu/api/Dockerfile

edit (ARG + ENV + OCI labels)

Makalu/indexer/Dockerfile

edit (same)

Makalu/explorer/Dockerfile

edit (same — explorer also gets /version env)

Makalu/docker-compose.yaml

edit (build: { args } + environment: passthrough, OTEL + LOG_LEVEL vars)

Makalu/api/package.json

edit (pino, pino-http, six @opentelemetry/* packages)

Makalu/indexer/package.json

edit (pino, six @opentelemetry/* packages)

Makalu/api/src/index.ts

edit (/version, request-id middleware, request_completed log, top-of-file tracing import)

Makalu/api/src/lib/build-info.ts

create

Makalu/api/src/lib/logger.ts

create

Makalu/api/src/tracing.ts

create

Makalu/api/src/__tests__/build-info.test.ts

create (5 tests)

Makalu/api/src/__tests__/logger.test.ts

create (12 tests)

Makalu/api/src/__tests__/tracing.test.ts

create (2 tests)

Makalu/indexer/src/mappings.ts

edit (/version, cycleId scope around poll loop, top-of-file tracing import)

Makalu/indexer/src/lib/build-info.ts

create

Makalu/indexer/src/lib/logger.ts

create

Makalu/indexer/src/tracing.ts

create

Makalu/indexer/src/__tests__/build-info.test.ts

create (3 tests)

docs/governance/observability.md

create

docs/phases/phase-9-completion.md

create

_sidebar.md

edit (surface observability runbook + this report)

Commits: 20f0233 (build metadata), 8bf7fda (structured logging), ebc449d (OpenTelemetry SDK), + this docs commit. 22 new tests (api +19, indexer +3) — total now 122 api + 41 indexer.

Deferred work

  • OTel collector deployment. Wiring is dormant until a receiver (Jaeger / Tempo / Grafana Cloud / Datadog) is stood up. Likely validator-team coordinated — runs alongside Prometheus + Loki on the VPS or in a dedicated container. Separate ticket.

  • SLO + cost dashboards. The work-plan's Phase 9 lists both. They're analysis layers on top of metrics we already collect; deferred.

  • Helper-level console.loglogger migration. This phase ships enough call sites (lifecycle + cycle + startup) to prove the pattern and unblock Loki queries. The long tail (routes.ts, litho.ts, mappings.ts) is mechanical and best done in small batches.

  • Explorer (Next.js) instrumentation. Server-side tracing in Next uses an instrumentation.ts register pattern, not the NodeSDK pattern. Out of scope for this phase. /version env vars ARE propagated (so the UI can display the deployed build), but no span emission.

  • Promoting OTel from advisory to production-required. Once a collector is deployed and we have signal on what cardinality / cost looks like in steady state, we can flip from "env-gated, off in prod" to "always on, sampling configured."

Last updated