17. Monitoring, Logging & Observability

Observability is the ability to understand the internal state of a system by examining its external outputs. In distributed systems, observability is essential for debugging, performance optimization, and ensuring reliability.


The Three Pillars of Observability

                    Observability
                   /      |       \
                  /       |        \
           Metrics     Logs      Traces
           (What)    (Why)     (Where)

1. Metrics

Numerical measurements collected over time. Used for dashboards, alerting, and trend analysis.

cpu_usage{host="server-1"} 0.75         # 75% CPU
http_requests_total{status="200"} 15234  # Request counter
request_duration_seconds{p99} 0.250      # 250ms p99 latency

Types of Metrics:

Type Description Example
Counter Monotonically increasing value Total requests, total errors
Gauge Value that goes up and down CPU usage, memory usage, queue depth
Histogram Distribution of values in buckets Request latency distribution
Summary Calculated percentiles p50, p95, p99 latency

Key Metrics to Monitor:

Category Metrics
Service Request rate, error rate, latency (p50/p95/p99)
Infrastructure CPU, memory, disk I/O, network I/O
Database Query latency, connections, replication lag, cache hit rate
Queue Queue depth, consumer lag, processing rate
Business Orders/sec, revenue, active users, conversion rate

RED Method (Request-focused)

For request-driven services (APIs, web servers):

  • Rate: Requests per second
  • Errors: Failed requests per second
  • Duration: Latency distribution (p50, p95, p99)

USE Method (Resource-focused)

For infrastructure resources (CPU, memory, disk, network):

  • Utilization: Percentage of resource in use
  • Saturation: Queue length / overflow
  • Errors: Error count

Four Golden Signals (Google SRE)

  1. Latency: Time to serve a request
  2. Traffic: Requests per second
  3. Errors: Rate of failed requests
  4. Saturation: How full the system is (CPU, memory, queue depth)

2. Logs

Timestamped, immutable text records of discrete events.

{
  "timestamp": "2025-01-15T10:30:00.123Z",
  "level": "ERROR",
  "service": "payment-service",
  "traceId": "abc123def456",
  "spanId": "span-789",
  "message": "Payment processing failed",
  "userId": "user-456",
  "orderId": "order-789",
  "error": "CardDeclined",
  "duration_ms": 1250
}

Log Levels:

Level Use
TRACE Very detailed debugging (usually disabled in production)
DEBUG Detailed information for debugging
INFO Normal operation events
WARN Something unexpected but not immediately harmful
ERROR Something failed; requires attention
FATAL System cannot continue

Structured Logging Best Practices:

  • Use JSON format for machine parsability.
  • Include traceId for correlation with traces.
  • Include requestId for request tracking.
  • Include contextual fields (userId, orderId, etc.).
  • Don't log sensitive data (passwords, tokens, PII).
  • Use consistent field names across all services.

Log Aggregation Pipeline:

Services → [Log Shipper (Fluentd/Filebeat)] → [Kafka / Buffer]
                                                      ↓
                                              [Log Processing]
                                                      ↓
                                              [Elasticsearch / Loki]
                                                      ↓
                                              [Kibana / Grafana]

3. Distributed Traces

A trace follows a single request as it flows through multiple services. Each service-to-service hop is a span.

Trace ID: abc-123

Client → [API Gateway]  Span 1: 250ms
              ↓
         [User Service]  Span 2: 50ms
              ↓
         [Order Service] Span 3: 150ms
              ↓
         [Payment Service] Span 4: 100ms ← Bottleneck identified!
              ↓
         [Database]       Span 5: 80ms

Trace visualization (waterfall):

├── API Gateway ──────────────────────────────── 250ms
│   ├── User Service ────── 50ms
│   ├── Order Service ─────────────── 150ms
│   │   ├── Payment Service ──────── 100ms
│   │   │   └── Payment DB ───── 80ms
│   │   └── Inventory Check ── 30ms
│   └── Notification Service ── 20ms (async)

Key concepts:
| Concept | Description |
|---------|-------------|
| Trace | End-to-end journey of a request |
| Span | A single unit of work within a trace |
| Span Context | Propagated between services (trace ID, span ID, flags) |
| Baggage | Custom key-value data propagated across services |
| Sampling | Collect only a fraction of traces to manage volume |


Observability Stack

Component Open Source Commercial
Metrics Prometheus + Grafana Datadog, New Relic, Dynatrace
Logs ELK (Elasticsearch, Logstash, Kibana), Loki + Grafana Splunk, Sumo Logic
Traces Jaeger, Zipkin Datadog APM, Honeycomb
All-in-one Grafana Stack (Prometheus + Loki + Tempo) Datadog, New Relic
Instrumentation OpenTelemetry (OTEL) Vendor SDKs

OpenTelemetry (OTEL)

The industry standard for instrumentation — a single API for metrics, logs, and traces.

Application → [OTEL SDK] → [OTEL Collector] → Prometheus (metrics)
                                             → Loki (logs)
                                             → Jaeger (traces)

Alerting

Alert Design Principles

  1. Alert on symptoms, not causes: Alert on "error rate > 5%" not "CPU > 80%".
  2. Every alert should be actionable: If you can't do anything about it, don't alert.
  3. Use severity levels: Critical (page), Warning (ticket), Info (dashboard).
  4. Avoid alert fatigue: Too many alerts = alerts get ignored.

Alerting Pipeline

Metrics → [Alert Rules] → Triggered? → [Notification Channel]
                                              │
                                    ├── PagerDuty (critical)
                                    ├── Slack (warning)
                                    ├── Email (info)
                                    └── Dashboard (all)

SLI, SLO, SLA

Concept Definition Example
SLI (Service Level Indicator) A measurable metric p99 latency, error rate, availability
SLO (Service Level Objective) Target for an SLI p99 latency < 200ms, availability > 99.9%
SLA (Service Level Agreement) Contractual commitment If availability < 99.9%, customer gets credit

Error Budget:

If SLO is 99.9% availability:

  • Error budget = 0.1% = ~8.76 hours/year
  • If error budget is exhausted → freeze deployments, focus on reliability.

Health Checks

Liveness Check

"Is the service running?"

GET /health/live → 200 OK (process is alive)

Readiness Check

"Is the service ready to handle traffic?"

GET /health/ready → 200 OK (connected to DB, cache warmed, etc.)
                  → 503 Service Unavailable (not ready)

Deep Health Check

"Are all dependencies healthy?"

GET /health/deep
{
  "status": "healthy",
  "checks": {
    "database": {"status": "up", "latency_ms": 5},
    "redis": {"status": "up", "latency_ms": 1},
    "external-api": {"status": "degraded", "latency_ms": 500}
  }
}

Incident Response

Alert Fired → Acknowledge → Triage → Diagnose → Mitigate → Resolve → Post-mortem

On-Call Best Practices:

  • Runbooks for common alerts.
  • Escalation paths if not acknowledged within N minutes.
  • Blameless post-mortems (focus on systems, not people).
  • Track MTTD (Mean Time to Detect), MTTR (Mean Time to Recover).

Summary

Concept Key Point
Metrics Numerical data over time — dashboards and alerting
Logs Detailed event records — debugging and auditing
Traces Request flows across services — latency analysis
RED method Rate, Errors, Duration — for services
USE method Utilization, Saturation, Errors — for infrastructure
OpenTelemetry Industry-standard instrumentation
SLI/SLO/SLA Measure → target → contractual commitment
Alert on symptoms Error rate, latency — not CPU percentage

Rule of thumb: Instrument all services with metrics, structured logs, and distributed traces from day one. Use OpenTelemetry for vendor-neutral instrumentation. Alert on SLO violations, not individual metrics.