Distributed Tracing with OpenTelemetry
A request hits your API gateway, fans out to three microservices, queries two databases, and returns in 800ms. One of those services is slow. Logs tell you each service received and returned a request. Metrics tell you the p99 latency per service. But neither tells you which path this specific slow request took, or how long each hop contributed. Distributed tracing does.
Why Tracing
Tracing answers questions that logs and metrics can't:
- Which service is responsible for the tail latency on this request?
- What was the full call graph for request ID
abc123? - Did the slowness come from the database query or the network hop?
- Which downstream dependency is degraded right now?
Traces, Spans, Context
A trace represents a single end-to-end request. It is composed of spans — individual units of work (an HTTP handler, a DB query, a cache lookup). Spans form a tree: the root span is the entry point; child spans are downstream calls.
Trace context is propagated between services via HTTP headers (W3C traceparent header is the standard). Every service that receives a request extracts the trace ID and span ID, creates a child span, and forwards the context to downstream calls.
OTel Stack Overview
OpenTelemetry (OTel) is the CNCF standard for telemetry instrumentation. It unifies traces, metrics, and logs under one API so you don't depend on vendor SDKs.
| Component | Role |
|---|---|
| OTel SDK | Library added to the application. Creates spans and exports telemetry. |
| OTel Operator | Kubernetes operator that injects OTel SDK automatically via webhook — no code changes. |
| OTel Collector | Receives, processes, and exports telemetry. Can fan out to multiple backends. |
| Backend (Jaeger/Tempo) | Stores and queries traces. Jaeger is self-contained; Tempo integrates with Grafana. |
Auto-Instrumentation
The OTel Operator can inject the OTel SDK as an init container and configure it via environment variables — no application code changes required for many frameworks (Java, Python, Node.js, .NET, Go).
apiVersion: opentelemetry.io/v1alpha1
kind: Instrumentation
metadata:
name: myapp-instrumentation
namespace: production
spec:
exporter:
endpoint: http://otel-collector.monitoring:4317 # gRPC OTLP
propagators:
- tracecontext # W3C standard
- baggage
sampler:
type: parentbased_traceidratio
argument: "0.1" # sample 10% of root spans
java:
image: ghcr.io/open-telemetry/opentelemetry-operator/autoinstrumentation-java:latest
apiVersion: apps/v1
kind: Deployment
metadata:
name: order-svc
namespace: production
spec:
template:
metadata:
annotations:
instrumentation.opentelemetry.io/inject-java: "myapp-instrumentation"
# For Python: instrumentation.opentelemetry.io/inject-python: "myapp-instrumentation"
# For Node.js: instrumentation.opentelemetry.io/inject-nodejs: "myapp-instrumentation"
OTel Collector
The Collector acts as a telemetry hub. Pipelines are composed of receivers (how data comes in), processors (transformations), and exporters (where data goes).
receivers:
otlp:
protocols:
grpc:
endpoint: 0.0.0.0:4317
http:
endpoint: 0.0.0.0:4318
processors:
batch:
timeout: 5s
send_batch_size: 512
memory_limiter:
limit_mib: 512
spike_limit_mib: 128
resource:
attributes:
- action: insert
key: k8s.cluster.name
value: production
exporters:
otlp/tempo:
endpoint: tempo.monitoring:4317
tls:
insecure: true
jaeger:
endpoint: jaeger-collector.monitoring:14250
tls:
insecure: true
service:
pipelines:
traces:
receivers: [otlp]
processors: [memory_limiter, batch, resource]
exporters: [otlp/tempo, jaeger]
Backends — Jaeger & Tempo
| Jaeger | Grafana Tempo | |
|---|---|---|
| UI | Built-in Jaeger UI | Grafana (same dashboard as metrics/logs) |
| Storage | Elasticsearch, Cassandra, Badger | Object storage (S3, GCS) — very cheap |
| Best for | Standalone tracing, existing ES | Unified Grafana stack (logs + metrics + traces) |
| TraceQL | Jaeger query language | TraceQL — similar to PromQL/LogQL |
Sampling Strategies
Tracing every request is expensive. Sampling controls what fraction of traces are recorded.
| Strategy | How it works | Tradeoff |
|---|---|---|
| Head-based (ratio) | Decision made at the root span. Sample 10% of all requests. | Simple. May miss rare errors if not sampled. |
| Tail-based | Decision made after trace completes. Always record slow or errored traces. | More useful signal. Requires buffering full traces in collector. |
| Always-on | Record everything. | Full fidelity. Only viable at low request volume. |
processors:
tail_sampling:
decision_wait: 10s # wait for all spans before deciding
policies:
- name: errors-policy
type: status_code
status_code: {status_codes: [ERROR]}
- name: slow-traces
type: latency
latency: {threshold_ms: 500}
- name: probabilistic-sample
type: probabilistic
probabilistic: {sampling_percentage: 5} # 5% of remaining traces
Log–Trace Correlation
The real power is correlation: click a slow trace span, jump to the exact log lines emitted during that span. This requires the application to include trace_id and span_id in every log line.
from opentelemetry import trace
import structlog
def add_trace_context(logger, method, event_dict):
span = trace.get_current_span()
ctx = span.get_span_context()
if ctx.is_valid:
event_dict["trace_id"] = format(ctx.trace_id, "032x")
event_dict["span_id"] = format(ctx.span_id, "016x")
return event_dict
structlog.configure(
processors=[add_trace_context, structlog.processors.JSONRenderer()]
)
In Grafana, configure the Loki datasource to derive fields from trace_id and link to Tempo — this creates a clickable "View Trace" button directly in log lines.