AWS (Day 8)

observability on Kubernetes: logs, metrics, traces, and alerts with the open-source stack

Me, observing

Disclaimers :

  1. Opinions expressed in this post (and in any of all my posts) are solely, unless otherwise specified, those of the authors, me. Those opinions absolutely do not reflect the views, policies, positions of any organizations, employers, affiliated groups.

  2. This article is educational content. The examples are intentionally simplified for clarity. All tools featured here are free and open-source software — they run identically on EKS, GKE, bare metal, or your laptop.

  3. I've strived for accuracy throughout this piece. If you catch any errors, please reach out — I'd be grateful for the feedback and happy to make updates!


Hook

The genomics API is running. Terraform provisioned the cluster, Helm deployed the application. From the outside, everything looks fine.

Then, a researcher reports that their variant analysis job submitted two hours ago still hasn't returned results. You check Kubernetes: the pods are running. You check the load balancer: it's healthy. The deployment logs show no errors. But something is wrong, and you have absolutely no idea where to start looking.

That's the moment you realise your system is blind. You built the infrastructure. You secured it. You automated its deployment. And then you flew it into production with no instruments.

This article is about fixing that. Not by adding monitoring as an afterthought, but by building the kind of visibility that lets you answer three questions before your users even notice a problem: what is happening, why is it happening, and where exactly is it happening? No worries, I'm not there yet. Lord knows how much 4s misses observability & monitoring :\



ToC

  1. Observability vs monitoring
  2. The four pillars
  3. The tooling landscape
  4. OpenTelemetry: the collection standard
  5. SLI, SLO, SLA: from data to accountability
  6. Conclusion
  7. More on this topic



Observability vs monitoring

These two words are often used interchangeably. They shouldn't be.

Monitoring is about watching known failure modes. You define thresholds — CPU above 80%, error rate above 1% — and you alert when they're crossed. It answers the question: "is this thing I already know about happening?" Monitoring is reactive: you first have to know what can go wrong before you can monitor for it.

Observability is about understanding unknown failure modes from the outside. A system is observable if you can ask arbitrary questions about its internal state — without having predicted those questions in advance. It answers: "why is this behaving the way it is?"

In practice, you need both. Monitoring te dit "quelque chose est cassé". Observability te permet de répondre à "pourquoi est-ce cassé, exactement, pour qui, depuis quand, et dans quel contexte" — même pour des pannes que tu n'avais pas anticipées.

In a K8s cluster, failures are often not the ones you predicted. A genomic analysis pipeline stalls not because of high CPU, but because a database connection pool is exhausted by a single slow query in one namespace — and your monitoring threshold was on the wrong metric entirely. Observability gives you the tools to discover that. Monitoring alone doesn't.

observability



monitoring



The four pillars

Cloud-native observability is built on four pillars. Together, they give you a complete picture of a running system.

Logs — what happened

Logs are timestamped records of discrete events: a request came in, a query failed, a job completed. They are the most familiar signal and the easiest to produce — every Django app already writes logs.

On Kubernetes, logs have a structural challenge: pods are ephemeral. When a pod crashes and restarts, its logs disappear. You need a log aggregation layer that collects logs from all pods, across all nodes, and stores them centrally before they vanish.

Answers: "What exactly happened at 14:32:07 in the genomics namespace?"

Metrics — how much

Metrics are numeric measurements sampled over time: request rate, error rate, latency percentiles, CPU usage, memory consumption, active database connections. They are cheap to store, fast to query, and ideal for alerting and dashboards.

Metrics don't tell you why something is slow — but they tell you that it is slow and when it started.

Answers: "How many analysis requests failed in the last hour, and is that number growing?"

Traces — the full journey

A trace follows a single request through every service it touches. In a microservices or multi-component architecture, a single API call might touch a Django view, a Celery task, a PostgreSQL query, and an S3 upload. A trace connects all of these into a single timeline, with durations for each step.

Traces are the pillar that answers the question logs and metrics can't: "the request was slow — but which part of it was slow?"

Answers: "The variant analysis request took 4.2 seconds — 3.8 seconds of which was a single database query in the pipeline."

Kubernetes events — what the cluster decided

Kubernetes emits events for everything it does: pod scheduled, pod evicted, image pull failed, node pressure detected, HPA scaled up. These events are not logs (they come from the control plane, not the application) and not metrics (they are discrete, not continuous). They are a fourth pillar that explains cluster-level behaviour.

Answers: "The pod restarted three times because the node ran out of memory — not because the application crashed."



The tooling landscape

Two paths exist for implementing this stack on Kubernetes.

SignalAWS-nativeOpen source
LogsCloudWatch LogsLoki + Promtail
MetricsCloudWatch MetricsPrometheus + node-exporter
TracesAWS X-RayJaeger
VisualisationCloudWatch DashboardsGrafana
InstrumentationAWS Distro for OpenTelemetryOpenTelemetry
AlertingCloudWatch AlarmsPrometheus Alertmanager

We are going with the open-source stack. These tools run identically on EKS, GKE, bare metal, or your laptop. They have no vendor lock-in, no per-metric pricing, and a massive community behind each of them. If the genomics platform ever moves off AWS, the observability layer moves with it unchanged.

How the pieces fit together

Django app
  └─ OTel SDK (auto-instrumentation)
       └─ OTel Collector
            ├─ metrics ──→ Prometheus ──→ Grafana
            ├─ logs ─────→ Loki ─────────→ Grafana
            └─ traces ───→ Jaeger ────────→ Grafana
                                    Alertmanager ──→ PagerDuty / Slack

A single OTel Collector acts as the central routing layer — your application sends all three signal types to one endpoint, and the collector fans them out to the right backends. Grafana sits on top of all three, giving you a unified view. This architecture means your application code only knows about one destination: the collector.

Deploying the stack with Helm

From Day 6, you already know how to use Helm. The entire observability stack is available as Helm charts:

helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
helm repo add grafana https://grafana.github.io/helm-charts
helm repo add jaegertracing https://jaegertracing.github.io/helm-charts
helm repo add open-telemetry https://open-telemetry.github.io/opentelemetry-helm-charts
helm repo update

# Prometheus + Grafana + Alertmanager — all in one chart
helm install kube-prometheus-stack \
  prometheus-community/kube-prometheus-stack \
  --namespace monitoring \
  --create-namespace

# Loki + Promtail (log collection from all pods)
helm install loki grafana/loki-stack \
  --namespace monitoring \
  --set promtail.enabled=true

# Jaeger (distributed tracing backend)
helm install jaeger jaegertracing/jaeger \
  --namespace monitoring

# OpenTelemetry Collector (the routing layer)
helm install otel-collector \
  open-telemetry/opentelemetry-collector \
  --namespace monitoring \
  --values otel-collector-values.yaml



OpenTelemetry: the collection standard

OpenTelemetry (OTel) is a CNCF project that provides a single, vendor-neutral standard for producing and collecting observability data. Before OTel, you needed separate SDKs for Prometheus, Jaeger, and your logging library — each with different APIs and configuration. OTel unifies them: one SDK, one wire protocol, one collector.

It has two parts: the SDK (what you add to your application to produce telemetry) and the Collector (the agent that receives, processes, and exports it).

The OTel Collector: central routing

The collector is configured as a pipeline: receivers → processors → exporters. Here is the configuration for our stack (otel-collector-values.yaml):

config:
  receivers:
    otlp:
      protocols:
        grpc:
          endpoint: 0.0.0.0:4317
        http:
          endpoint: 0.0.0.0:4318

  processors:
    batch: {}
    memory_limiter:
      limit_mib: 512

  exporters:
    prometheus:
      endpoint: "0.0.0.0:8889"
    loki:
      endpoint: http://loki:3100/loki/api/v1/push
    jaeger:
      endpoint: jaeger-collector:14250
      tls:
        insecure: true

  service:
    pipelines:
      metrics:
        receivers: [otlp]
        processors: [memory_limiter, batch]
        exporters: [prometheus]
      logs:
        receivers: [otlp]
        processors: [memory_limiter, batch]
        exporters: [loki]
      traces:
        receivers: [otlp]
        processors: [memory_limiter, batch]
        exporters: [jaeger]

Three pipelines, three backends — your application only talks to port 4317.

Instrumenting the Django API

The genomics API needs two things: a way to produce telemetry (the OTel SDK) and a way to expose metrics for Prometheus to scrape (the django-prometheus package).

Dependencies:

pip install \
  opentelemetry-sdk \
  opentelemetry-instrumentation-django \
  opentelemetry-instrumentation-psycopg2 \
  opentelemetry-exporter-otlp-proto-grpc \
  django-prometheus

Auto-instrumentation setup — the right place to call this is AppConfig.ready() in your app's apps.py, which runs exactly once on startup regardless of whether you're using manage.py, a WSGI server, or an ASGI server:

from opentelemetry import trace
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.sdk.trace.export import BatchSpanProcessor
from opentelemetry.exporter.otlp.proto.grpc.trace_exporter import OTLPSpanExporter
from opentelemetry.instrumentation.django import DjangoInstrumentor
from opentelemetry.instrumentation.psycopg2 import Psycopg2Instrumentor

def configure_tracing():
    provider = TracerProvider()
    exporter = OTLPSpanExporter(
        endpoint="http://otel-collector.monitoring:4317",
        insecure=True
    )
    provider.add_span_processor(BatchSpanProcessor(exporter))
    trace.set_tracer_provider(provider)

    # Auto-instrument Django HTTP requests and PostgreSQL queries
    DjangoInstrumentor().instrument()
    Psycopg2Instrumentor().instrument()
# yourapp/apps.py
from django.apps import AppConfig

class YourAppConfig(AppConfig):
    name = "yourapp"

    def ready(self):
        configure_tracing()

With this in place, every HTTP request and every database query is automatically traced — no manual instrumentation needed for the infrastructure layer.

Custom spans for business logic — the above covers HTTP and database calls automatically. For domain-specific visibility into your own code, add manual spans:

# analysis/views.py
from opentelemetry import trace

tracer = trace.get_tracer(__name__)

def run_variant_analysis(request, patient_id):
    with tracer.start_as_current_span("variant-analysis") as span:
        span.set_attribute("patient.id", patient_id)
        span.set_attribute("analysis.type", "variant-calling")

        result = VariantCallingPipeline.run(patient_id)

        span.set_attribute("analysis.variants_found", result.variant_count)
        span.set_attribute("analysis.duration_ms", result.duration_ms)
        return JsonResponse(result.to_dict())

Now when an analysis is slow, the trace shows exactly where the time went: Django view setup, PostgreSQL query, pipeline execution, response serialisation — each as a separate span with its own duration.

Exposing metrics for Prometheus

Add django-prometheus to INSTALLED_APPS and wire up the metrics endpoint:

# settings.py
INSTALLED_APPS = [
    ...
    'django_prometheus',
]

MIDDLEWARE = [
    'django_prometheus.middleware.PrometheusBeforeMiddleware',
    ...
    'django_prometheus.middleware.PrometheusAfterMiddleware',
]
# urls.py
from django.urls import path, include

urlpatterns = [
    path('', include('django_prometheus.urls')),  # exposes /metrics
    ...
]

Then tell Prometheus to scrape it with a ServiceMonitor (from the kube-prometheus-stack CRDs):

apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
  name: genomics-api
  namespace: monitoring
spec:
  selector:
    matchLabels:
      app: genomics-api
  namespaceSelector:
    matchNames:
      - genomics
  endpoints:
  - port: http
    path: /metrics
    interval: 30s

Prometheus now automatically discovers and scrapes the genomics API every 30 seconds. No manual configuration when pods are rescheduled.



SLI, SLO, SLA: from data to accountability

Collecting telemetry is pointless without defining what good looks like. SLI, SLO, and SLA are the framework for turning raw observability data into meaningful commitments.

SLI (Service Level Indicator) — a measurement. A specific, quantifiable signal that reflects the user experience:

"The percentage of analysis API requests that return in under 2 seconds."

SLO (Service Level Objective) — the target you set for that indicator:

"99% of analysis requests should return in under 2 seconds, measured over a rolling 30-day window."

SLA (Service Level Agreement) — the formal commitment to external parties, with consequences:

"If availability drops below 99.5% in any calendar month, affected institutions receive a service credit."

The gap between your SLO (99%) and 100% is your error budget — the room you have to take risks, deploy changes, and absorb incidents without breaching your commitment. If you have a 1% error budget over 30 days, that's about 7 hours of degraded service you can afford. Spend it wisely.

Measuring the SLI in Prometheus

groups:
- name: genomics-api.sli
  rules:
  - record: job:genomics_api_p99_latency_seconds
    expr: |
      histogram_quantile(0.99,
        rate(django_http_requests_latency_seconds_by_view_method_bucket{
          job="genomics-api",
          view="run_variant_analysis"
        }[5m])
      )

Alerting when the error budget burns too fast

groups:
- name: genomics-api.slo
  rules:
  - alert: AnalysisAPIErrorBudgetBurn
    expr: |
      (
        rate(django_http_responses_total_by_status_view_method_total{
          job="genomics-api",
          status=~"5.."
        }[1h])
        /
        rate(django_http_responses_total_by_status_view_method_total{
          job="genomics-api"
        }[1h])
      ) > 0.01
    for: 5m
    labels:
      severity: warning
      team: genomics-platform
    annotations:
      summary: "Genomics API error rate above SLO threshold"
      description: >
        Error rate is {{ $value | humanizePercentage }}.
        At this rate, the monthly error budget will be exhausted in
        {{ printf "%.1f" (div 0.01 $value | mul 720) }} hours.

This alert doesn't just tell you that the error rate is high — it tells you how quickly you are burning through your error budget, so you can decide whether to roll back immediately or investigate first.

A good alerting rule alerts on user impact, not on infrastructure symptoms. "Error budget burning at 10x" is more actionable than "CPU at 78%".



Conclusion

A deployed system that you can't observe is a system you don't fully control. Logs tell you what happened. Metrics tell you how much and how often. Traces tell you where exactly things went wrong. Kubernetes events tell you what the cluster itself was doing. Together, they answer the question that every production incident eventually raises: "what on earth is going on in there?"

The open-source stack — Prometheus, Grafana, Loki, Jaeger, OpenTelemetry — gives you all of this without vendor lock-in. It runs on EKS today, on bare metal tomorrow, on a laptop for local debugging when you need it. The investment in learning these tools pays dividends regardless of which cloud you're on.

And SLI/SLO/SLA takes that data one step further: it turns raw signals into a shared language between the people who build the system and the people who depend on it. Error budgets make the trade-off between reliability and velocity explicit, instead of leaving it as a permanent source of tension.

The platform is now provisioned, secured, deployed, and observable. That's a complete stack — and the knowledge to rebuild it anywhere.



More on this topic

Official documentation:

Tools:

Video tutorials: