17. Monitoring, Logging & Observability

title: 17. Monitoring, Logging & Observability

Observability is the ability to understand the internal state of a system by examining its external outputs. In distributed systems, observability is essential for debugging, performance optimization, and ensuring reliability.

The Three Pillars of Observability

                    Observability
                   /      |       \
                  /       |        \
           Metrics     Logs      Traces
           (What)    (Why)     (Where)

1. Metrics

Numerical measurements collected over time. Used for dashboards, alerting, and trend analysis.

cpu_usage{host="server-1"} 0.75         # 75% CPU
http_requests_total{status="200"} 15234  # Request counter
request_duration_seconds{p99} 0.250      # 250ms p99 latency

Types of Metrics:

Type	Description	Example
Counter	Monotonically increasing value	Total requests, total errors
Gauge	Value that goes up and down	CPU usage, memory usage, queue depth
Histogram	Distribution of values in buckets	Request latency distribution
Summary	Calculated percentiles	p50, p95, p99 latency

Key Metrics to Monitor:

Category	Metrics
Service	Request rate, error rate, latency (p50/p95/p99)
Infrastructure	CPU, memory, disk I/O, network I/O
Database	Query latency, connections, replication lag, cache hit rate
Queue	Queue depth, consumer lag, processing rate
Business	Orders/sec, revenue, active users, conversion rate

RED Method (Request-focused)

For request-driven services (APIs, web servers):

Rate: Requests per second
Errors: Failed requests per second
Duration: Latency distribution (p50, p95, p99)

USE Method (Resource-focused)

For infrastructure resources (CPU, memory, disk, network):

Utilization: Percentage of resource in use
Saturation: Queue length / overflow
Errors: Error count

Four Golden Signals (Google SRE)

Latency: Time to serve a request
Traffic: Requests per second
Errors: Rate of failed requests
Saturation: How full the system is (CPU, memory, queue depth)

2. Logs

Timestamped, immutable text records of discrete events.

{
  "timestamp": "2025-01-15T10:30:00.123Z",
  "level": "ERROR",
  "service": "payment-service",
  "traceId": "abc123def456",
  "spanId": "span-789",
  "message": "Payment processing failed",
  "userId": "user-456",
  "orderId": "order-789",
  "error": "CardDeclined",
  "duration_ms": 1250
}

Log Levels:

Level	Use
TRACE	Very detailed debugging (usually disabled in production)
DEBUG	Detailed information for debugging
INFO	Normal operation events
WARN	Something unexpected but not immediately harmful
ERROR	Something failed; requires attention
FATAL	System cannot continue

Structured Logging Best Practices:

Use JSON format for machine parsability.
Include traceId for correlation with traces.
Include requestId for request tracking.
Include contextual fields (userId, orderId, etc.).
Don't log sensitive data (passwords, tokens, PII).
Use consistent field names across all services.

Log Aggregation Pipeline:

Services → [Log Shipper (Fluentd/Filebeat)] → [Kafka / Buffer]
                                                      ↓
                                              [Log Processing]
                                                      ↓
                                              [Elasticsearch / Loki]
                                                      ↓
                                              [Kibana / Grafana]

3. Distributed Traces

A trace follows a single request as it flows through multiple services. Each service-to-service hop is a span.

Trace ID: abc-123

Client → [API Gateway]  Span 1: 250ms
              ↓
         [User Service]  Span 2: 50ms
              ↓
         [Order Service] Span 3: 150ms
              ↓
         [Payment Service] Span 4: 100ms ← Bottleneck identified!
              ↓
         [Database]       Span 5: 80ms

Trace visualization (waterfall):

├── API Gateway ──────────────────────────────── 250ms
│   ├── User Service ────── 50ms
│   ├── Order Service ─────────────── 150ms
│   │   ├── Payment Service ──────── 100ms
│   │   │   └── Payment DB ───── 80ms
│   │   └── Inventory Check ── 30ms
│   └── Notification Service ── 20ms (async)

Key concepts:
| Concept | Description |
|---------|-------------|
| Trace | End-to-end journey of a request |
| Span | A single unit of work within a trace |
| Span Context | Propagated between services (trace ID, span ID, flags) |
| Baggage | Custom key-value data propagated across services |
| Sampling | Collect only a fraction of traces to manage volume |

Observability Stack

Component	Open Source	Commercial
Metrics	Prometheus + Grafana	Datadog, New Relic, Dynatrace
Logs	ELK (Elasticsearch, Logstash, Kibana), Loki + Grafana	Splunk, Sumo Logic
Traces	Jaeger, Zipkin	Datadog APM, Honeycomb
All-in-one	Grafana Stack (Prometheus + Loki + Tempo)	Datadog, New Relic
Instrumentation	OpenTelemetry (OTEL)	Vendor SDKs

OpenTelemetry (OTEL)

The industry standard for instrumentation — a single API for metrics, logs, and traces.

Application → [OTEL SDK] → [OTEL Collector] → Prometheus (metrics)
                                             → Loki (logs)
                                             → Jaeger (traces)

Alerting

Alert Design Principles

Alert on symptoms, not causes: Alert on "error rate > 5%" not "CPU > 80%".
Every alert should be actionable: If you can't do anything about it, don't alert.
Use severity levels: Critical (page), Warning (ticket), Info (dashboard).
Avoid alert fatigue: Too many alerts = alerts get ignored.

Alerting Pipeline

Metrics → [Alert Rules] → Triggered? → [Notification Channel]
                                              │
                                    ├── PagerDuty (critical)
                                    ├── Slack (warning)
                                    ├── Email (info)
                                    └── Dashboard (all)

SLI, SLO, SLA

Concept	Definition	Example
SLI (Service Level Indicator)	A measurable metric	p99 latency, error rate, availability
SLO (Service Level Objective)	Target for an SLI	p99 latency < 200ms, availability > 99.9%
SLA (Service Level Agreement)	Contractual commitment	If availability < 99.9%, customer gets credit

Error Budget:

If SLO is 99.9% availability:

Error budget = 0.1% = ~8.76 hours/year
If error budget is exhausted → freeze deployments, focus on reliability.

Health Checks

Liveness Check

"Is the service running?"

GET /health/live → 200 OK (process is alive)

Readiness Check

"Is the service ready to handle traffic?"

GET /health/ready → 200 OK (connected to DB, cache warmed, etc.)
                  → 503 Service Unavailable (not ready)

Deep Health Check

"Are all dependencies healthy?"

GET /health/deep
{
  "status": "healthy",
  "checks": {
    "database": {"status": "up", "latency_ms": 5},
    "redis": {"status": "up", "latency_ms": 1},
    "external-api": {"status": "degraded", "latency_ms": 500}
  }
}

Incident Response

Alert Fired → Acknowledge → Triage → Diagnose → Mitigate → Resolve → Post-mortem

On-Call Best Practices:

Runbooks for common alerts.
Escalation paths if not acknowledged within N minutes.
Blameless post-mortems (focus on systems, not people).
Track MTTD (Mean Time to Detect), MTTR (Mean Time to Recover).

Summary

Concept	Key Point
Metrics	Numerical data over time — dashboards and alerting
Logs	Detailed event records — debugging and auditing
Traces	Request flows across services — latency analysis
RED method	Rate, Errors, Duration — for services
USE method	Utilization, Saturation, Errors — for infrastructure
OpenTelemetry	Industry-standard instrumentation
SLI/SLO/SLA	Measure → target → contractual commitment
Alert on symptoms	Error rate, latency — not CPU percentage

Rule of thumb: Instrument all services with metrics, structured logs, and distributed traces from day one. Use OpenTelemetry for vendor-neutral instrumentation. Alert on SLO violations, not individual metrics.