17. Monitoring, Logging & Observability
Observability is the ability to understand the internal state of a system by examining its external outputs. In distributed systems, observability is essential for debugging, performance optimization, and ensuring reliability.
The Three Pillars of Observability
Observability
/ | \
/ | \
Metrics Logs Traces
(What) (Why) (Where)
1. Metrics
Numerical measurements collected over time. Used for dashboards, alerting, and trend analysis.
cpu_usage{host="server-1"} 0.75 # 75% CPU
http_requests_total{status="200"} 15234 # Request counter
request_duration_seconds{p99} 0.250 # 250ms p99 latency
Types of Metrics:
| Type | Description | Example |
|---|---|---|
| Counter | Monotonically increasing value | Total requests, total errors |
| Gauge | Value that goes up and down | CPU usage, memory usage, queue depth |
| Histogram | Distribution of values in buckets | Request latency distribution |
| Summary | Calculated percentiles | p50, p95, p99 latency |
Key Metrics to Monitor:
| Category | Metrics |
|---|---|
| Service | Request rate, error rate, latency (p50/p95/p99) |
| Infrastructure | CPU, memory, disk I/O, network I/O |
| Database | Query latency, connections, replication lag, cache hit rate |
| Queue | Queue depth, consumer lag, processing rate |
| Business | Orders/sec, revenue, active users, conversion rate |
RED Method (Request-focused)
For request-driven services (APIs, web servers):
- Rate: Requests per second
- Errors: Failed requests per second
- Duration: Latency distribution (p50, p95, p99)
USE Method (Resource-focused)
For infrastructure resources (CPU, memory, disk, network):
- Utilization: Percentage of resource in use
- Saturation: Queue length / overflow
- Errors: Error count
Four Golden Signals (Google SRE)
- Latency: Time to serve a request
- Traffic: Requests per second
- Errors: Rate of failed requests
- Saturation: How full the system is (CPU, memory, queue depth)
2. Logs
Timestamped, immutable text records of discrete events.
{
"timestamp": "2025-01-15T10:30:00.123Z",
"level": "ERROR",
"service": "payment-service",
"traceId": "abc123def456",
"spanId": "span-789",
"message": "Payment processing failed",
"userId": "user-456",
"orderId": "order-789",
"error": "CardDeclined",
"duration_ms": 1250
}
Log Levels:
| Level | Use |
|---|---|
| TRACE | Very detailed debugging (usually disabled in production) |
| DEBUG | Detailed information for debugging |
| INFO | Normal operation events |
| WARN | Something unexpected but not immediately harmful |
| ERROR | Something failed; requires attention |
| FATAL | System cannot continue |
Structured Logging Best Practices:
- Use JSON format for machine parsability.
- Include
traceIdfor correlation with traces. - Include
requestIdfor request tracking. - Include contextual fields (userId, orderId, etc.).
- Don't log sensitive data (passwords, tokens, PII).
- Use consistent field names across all services.
Log Aggregation Pipeline:
Services → [Log Shipper (Fluentd/Filebeat)] → [Kafka / Buffer]
↓
[Log Processing]
↓
[Elasticsearch / Loki]
↓
[Kibana / Grafana]
3. Distributed Traces
A trace follows a single request as it flows through multiple services. Each service-to-service hop is a span.
Trace ID: abc-123
Client → [API Gateway] Span 1: 250ms
↓
[User Service] Span 2: 50ms
↓
[Order Service] Span 3: 150ms
↓
[Payment Service] Span 4: 100ms ← Bottleneck identified!
↓
[Database] Span 5: 80ms
Trace visualization (waterfall):
├── API Gateway ──────────────────────────────── 250ms
│ ├── User Service ────── 50ms
│ ├── Order Service ─────────────── 150ms
│ │ ├── Payment Service ──────── 100ms
│ │ │ └── Payment DB ───── 80ms
│ │ └── Inventory Check ── 30ms
│ └── Notification Service ── 20ms (async)
Key concepts:
| Concept | Description |
|---------|-------------|
| Trace | End-to-end journey of a request |
| Span | A single unit of work within a trace |
| Span Context | Propagated between services (trace ID, span ID, flags) |
| Baggage | Custom key-value data propagated across services |
| Sampling | Collect only a fraction of traces to manage volume |
Observability Stack
| Component | Open Source | Commercial |
|---|---|---|
| Metrics | Prometheus + Grafana | Datadog, New Relic, Dynatrace |
| Logs | ELK (Elasticsearch, Logstash, Kibana), Loki + Grafana | Splunk, Sumo Logic |
| Traces | Jaeger, Zipkin | Datadog APM, Honeycomb |
| All-in-one | Grafana Stack (Prometheus + Loki + Tempo) | Datadog, New Relic |
| Instrumentation | OpenTelemetry (OTEL) | Vendor SDKs |
OpenTelemetry (OTEL)
The industry standard for instrumentation — a single API for metrics, logs, and traces.
Application → [OTEL SDK] → [OTEL Collector] → Prometheus (metrics)
→ Loki (logs)
→ Jaeger (traces)
Alerting
Alert Design Principles
- Alert on symptoms, not causes: Alert on "error rate > 5%" not "CPU > 80%".
- Every alert should be actionable: If you can't do anything about it, don't alert.
- Use severity levels: Critical (page), Warning (ticket), Info (dashboard).
- Avoid alert fatigue: Too many alerts = alerts get ignored.
Alerting Pipeline
Metrics → [Alert Rules] → Triggered? → [Notification Channel]
│
├── PagerDuty (critical)
├── Slack (warning)
├── Email (info)
└── Dashboard (all)
SLI, SLO, SLA
| Concept | Definition | Example |
|---|---|---|
| SLI (Service Level Indicator) | A measurable metric | p99 latency, error rate, availability |
| SLO (Service Level Objective) | Target for an SLI | p99 latency < 200ms, availability > 99.9% |
| SLA (Service Level Agreement) | Contractual commitment | If availability < 99.9%, customer gets credit |
Error Budget:
If SLO is 99.9% availability:
- Error budget = 0.1% = ~8.76 hours/year
- If error budget is exhausted → freeze deployments, focus on reliability.
Health Checks
Liveness Check
"Is the service running?"
GET /health/live → 200 OK (process is alive)
Readiness Check
"Is the service ready to handle traffic?"
GET /health/ready → 200 OK (connected to DB, cache warmed, etc.)
→ 503 Service Unavailable (not ready)
Deep Health Check
"Are all dependencies healthy?"
GET /health/deep
{
"status": "healthy",
"checks": {
"database": {"status": "up", "latency_ms": 5},
"redis": {"status": "up", "latency_ms": 1},
"external-api": {"status": "degraded", "latency_ms": 500}
}
}
Incident Response
Alert Fired → Acknowledge → Triage → Diagnose → Mitigate → Resolve → Post-mortem
On-Call Best Practices:
- Runbooks for common alerts.
- Escalation paths if not acknowledged within N minutes.
- Blameless post-mortems (focus on systems, not people).
- Track MTTD (Mean Time to Detect), MTTR (Mean Time to Recover).
Summary
| Concept | Key Point |
|---|---|
| Metrics | Numerical data over time — dashboards and alerting |
| Logs | Detailed event records — debugging and auditing |
| Traces | Request flows across services — latency analysis |
| RED method | Rate, Errors, Duration — for services |
| USE method | Utilization, Saturation, Errors — for infrastructure |
| OpenTelemetry | Industry-standard instrumentation |
| SLI/SLO/SLA | Measure → target → contractual commitment |
| Alert on symptoms | Error rate, latency — not CPU percentage |
Rule of thumb: Instrument all services with metrics, structured logs, and distributed traces from day one. Use OpenTelemetry for vendor-neutral instrumentation. Alert on SLO violations, not individual metrics.