Lithosphere provides a comprehensive observability stack for monitoring validator nodes, infrastructure health, and application performance. The stack is built on industry-standard open-source tools and covers metrics collection, log aggregation, alerting, dashboards, and distributed tracing.
Prometheus
Prometheus is the core metrics collection and storage engine. It scrapes metrics from all Lithosphere services at configured intervals.
Configuration
The Prometheus configuration is defined in prometheus.yml:
Host-level system metrics (CPU, memory, disk, network)
cAdvisor
8080
Container-level resource usage and performance metrics
Alert Rules
Alert rules are defined in lithosphere-alerts.yml and evaluated by Prometheus. Alerts fire when specified conditions are met over a defined duration.
Alert Categories
Service Health
Service instance down for more than 1 minute
API endpoint returning errors above threshold
Indexer falling behind chain head
Performance
Request latency exceeding acceptable thresholds
Transaction processing time degradation
Block sync rate dropping below expected levels
Resources
CPU usage sustained above threshold
Memory usage exceeding configured limits
Disk space running low on data volumes
Container Issues
Container restart loops detected
Container OOM (out-of-memory) kills
Unhealthy container health checks
Grafana Dashboards
Grafana provides the visualization layer for all collected metrics. Pre-configured dashboards are available out of the box.
Pre-configured Dashboards
Dashboard
Description
System Overview
Host-level metrics including CPU, memory, disk I/O, and network throughput
API Monitoring
Request rates, latency percentiles, error rates, and endpoint-level breakdowns
Container Metrics
Per-container CPU, memory, network, and disk usage from cAdvisor
Auto-provisioning
Dashboards and data sources are automatically provisioned from the grafana/provisioning/ directory. No manual configuration is required on first startup.
Access
URL:http://localhost:3000
Default credentials:admin / lithosphere
Important: Change the default password immediately after first login in any non-local environment.
Adding Custom Dashboards
To add a custom dashboard:
Create the dashboard in the Grafana UI.
Export it as JSON via Share > Export > Save to file.
Place the JSON file in grafana/provisioning/dashboards/.
The dashboard will be automatically loaded on next restart.
Loki -- Log Aggregation
Loki provides centralized log aggregation for all Lithosphere services.
Configuration
Parameter
Value
Retention
30 days
Ingestion rate limit
4 MB/s
Storage backend
Filesystem
Loki stores logs on the local filesystem by default. For production deployments, consider configuring an object storage backend (S3, GCS) for improved durability and scalability.
Promtail -- Log Collection
Promtail is the log collection agent that ships logs to Loki.
Sources
Docker container logs -- Automatically collected from all running containers.
System logs -- Host-level system logs (syslog, journald).
Features
Auto-discovery -- Automatically discovers new containers and begins collecting their logs without manual configuration.
JSON parsing pipeline -- Parses structured JSON log output to extract labels and fields for efficient querying in Loki.
AlertManager -- Routing and Notifications
AlertManager receives alerts from Prometheus and routes them to the appropriate notification channels based on configurable rules.
Notification Channels
Slack Integration:
Email Integration:
Routing
Configure routing rules to direct alerts to specific channels based on severity, service, or other labels:
Performance Tuning
Reduce Disk Usage
Lower Prometheus retention period (default: 15 days). Set --storage.tsdb.retention.time=7d for shorter retention.
Increase Loki chunk target size to reduce the number of stored chunks.
Enable Prometheus WAL compression with --storage.tsdb.wal-compression.
Reduce Memory Consumption
Set container memory limits in your Docker Compose or Kubernetes manifests.
Reduce Prometheus --storage.tsdb.max-block-duration to limit in-memory block sizes.
Limit Loki ingester memory with ingester.max-chunk-age and ingester.max-chunk-idle.
Optimize Scrape Intervals
Increase the scrape interval for less critical targets (e.g., system metrics can use 30s or 60s intervals).
Use relabeling rules to drop high-cardinality metrics that are not needed.
Avoid scraping the same target from multiple Prometheus instances unless high availability is required.
Adding Custom Alerts
To add a custom alert rule:
Edit lithosphere-alerts.yml (or create a new rules file).
Define the alert using PromQL:
Reload Prometheus configuration (send SIGHUP or use the /-/reload endpoint).
OpenTelemetry Tracing
Lithosphere supports distributed tracing via OpenTelemetry for end-to-end request visibility across services.
Components
Component
Purpose
OTel JS SDK
Instruments application code to generate trace spans
OTel Collector
Receives, processes, and exports trace data
Grafana Tempo
Trace storage and query backend
How It Works
The OTel JS SDK is integrated into Lithosphere services to automatically instrument HTTP requests, database calls, and custom operations.
Traces are sent to the OTel Collector, which processes them (batching, sampling, enrichment) and exports them to the storage backend.
Grafana Tempo stores traces and makes them queryable through the Grafana UI.
In Grafana, traces can be correlated with logs (via Loki) and metrics (via Prometheus) using trace IDs for full-stack observability.