Skip to Content
TheCornerLabs Docs
DocsSystem DesignGrokking Scalable Systems for InterviewsObservabilityWhat Is Opentelemetry, And How Do Traces, Spans, Metrics, And Logs Fit Together

OpenTelemetry (OTel) is an open-source observability framework that standardizes how applications collect and export telemetry data, including logs, metrics, and traces across distributed systems.

Understanding OpenTelemetry and Its Importance

OpenTelemetry provides a unified set of APIs , SDKs, and tools for instrumenting code and gathering telemetry data from cloud-native applications.

By offering vendor-neutral instrumentation, OTel lets developers generate and send observability data without being tied to any single monitoring vendor or backend. This means teams can avoid vendor lock-in and integrate with multiple analysis platforms seamlessly.

As a Cloud Native Computing Foundation (CNCF) project (born from the merger of OpenTracing and OpenCensus in 2019), OpenTelemetry has quickly become the de facto standard for telemetry collection in modern microservices  and distributed systems.

In essence, OTel’s importance lies in providing a consistent way to instrument and gather observability data across many languages and environments, filling visibility gaps in complex architectures and enabling easier debugging and performance tuning.

Observability refers to the ability to understand internal system state by examining its outputs (telemetry data).

OpenTelemetry focuses on three primary categories of telemetry signals, often called the “three pillars of observability”, which are traces, metrics, and logs.

Each pillar provides a different perspective on system behavior, as explained below, and together they give a comprehensive view of system health.

Traces and Spans (Distributed Tracing)

Traces represent the end-to-end journey of a single request or transaction as it propagates through a distributed system.

In other words, a trace captures how a request (for example, an API call or user action) travels across multiple services and components from start to finish.

Traces are composed of one or more spans, which are the fundamental units of work in a trace.

A span represents a single operation or step within that request’s workflow (such as a function call, database query, or an HTTP request to another service).

Spans have metadata like a name, start/end timestamp, and other attributes to detail what happened during that operation.

Importantly, spans are linked together in a parent-child hierarchy: the first span in a trace is the root span (representing the initial operation, e.g. an incoming client request), and child spans represent sub-operations or downstream service calls.

This hierarchy of spans forms a directed acyclic graph (DAG), or simply a tree, that depicts the entire workflow. By correlating all spans with a common trace ID, OpenTelemetry can reconstruct the full path of the request through the system.

For example, consider a user placing an order on an e-commerce site.

The action of submitting the order would generate a trace that includes spans for each microservice involved. One span might cover the checkout service handling the request, another span for the payment service processing the payment, and yet another for the inventory service updating stock.

All these spans share the same trace ID to indicate they belong to the same transaction, with the checkout span as the root and others as children.

This trace would show how the request flowed and where time was spent or errors occurred.

Using OpenTelemetry’s tracing, developers gain a holistic view of how services interact to serve a request.

Distributed tracing makes it easier to pinpoint performance bottlenecks, latency sources, or failures in a complex microservice architecture.

Each span can also record contextual data (attributes and events), and errors in any span can be traced to identify which service or operation caused a problem.

In summary, traces tell you where and how a request moved through the system, and spans are the building blocks of that story, detailing each operation’s contribution.

Metrics

Metrics are numerical measurements captured over time, reflecting the performance or resource usage of a system.

They are essentially time-series data points that quantify aspects of system behavior.

Common examples of metrics include CPU utilization, memory consumption, request throughput, latency, error rates, and so on.

Metrics are typically collected at regular intervals and can be aggregated or averaged, making them useful for tracking trends and detecting anomalies.

For instance, a metric could show the number of requests per second your service handles or the average response time over the last 5 minutes.

In OpenTelemetry, metrics are captured via instrumented code (using counters, gauges, histograms, etc.) and can be exported to monitoring systems for analysis.

Metrics help answer the question “what is happening?” in terms of system load or performance.

They are excellent for watching rates and ratios (like traffic levels, error percentages) and are often used to set up alerts (e.g., trigger an alert if CPU usage exceeds 90% or if error rate doubles).

Because metrics are quantitative and structured, they can be efficiently stored and queried over long periods to analyze historical trends.

However, metrics usually don’t explain why a problem occurred. That’s where traces and logs come in.

OpenTelemetry’s metrics component allows you to collect these vital stats uniformly across services.

For example, you might measure the database query duration or memory usage of a service using OTel’s Metrics API, and then send those metrics to a backend like Prometheus.

By examining metrics, you might spot that a service’s latency spiked at a certain time or that memory usage steadily increased (potential memory leak).

In short, metrics tell you what changed in a system’s performance. They quantify the issue and indicate when/where to look, but they may not directly reveal the root cause.

Logs

Logs are a chronological record of discrete events that happen within a system. A log entry is usually a timestamped message with an optional payload or message detailing some event or error.

Almost every application generates logs. For example, an error log might record an exception stack trace, or an info log might record that “User X logged in at 12:00 UTC.” Logs can be plain text or structured (e.g. JSON) and often include metadata like severity level, service name, thread ID, etc. They provide the most granular insight into what the application was doing at a specific moment.

In observability, logs are invaluable for diagnosing the precise cause of issues.

They answer the “why did this happen?” by capturing detailed information about errors, state changes, or any notable events. For instance, if a trace shows a request failed in the payment service, the logs from that service (correlated by the same trace or request ID) might reveal an exception like “DatabaseTimeout: connection pool exhausted”. This level of detail helps engineers root-cause problems that metrics and traces only hint at.

OpenTelemetry supports log collection as well (though logs in OTel became stable later than traces/metrics). It can attach trace context to logs, so that each log can be associated with the trace of the request that triggered it.

This contextual logging means when you search a log, you can easily pivot to the related trace, and vice versa. Logs are often high volume, so storing all of them can be expensive; teams may use sampling or retention policies.

Nevertheless, when an incident occurs, logs provide the concrete evidence of what exactly went wrong (e.g., an exception message, a validation failure, etc.), complementing the high-level view from metrics and traces.

1762085039186270 Image scaled to 70%

How These Signals Fit Together for Observability

In a modern observability strategy, traces, metrics, and logs work in tandem to give a full picture of system behavior.

Each type of data excels at different tasks: metrics quantify the scope of an issue (e.g., a spike in error rate or latency), traces reveal the path and location of the issue across services, and logs provide details and context to explain the issue.

When combined and correlated, these three pillars provide a holistic view that is far more powerful than any one in isolation.

OpenTelemetry’s key advantage is that it unifies the instrumentation of all three signals, making it easier to correlate them.

For example, OTel can propagate a trace ID across services, and if that trace ID is attached to logs and metrics, you can quickly jump from an alerting metric to related traces, and then to the exact error log.

To illustrate, imagine your monitoring dashboard shows an increase in page load time (metric) for a web application.

Using OpenTelemetry, you can trace one of these slow requests. The trace might show that the request spent most of its time in a particular microservice. Drilling down, you find a span for a database call that took unusually long.

Now, by checking logs from that timeframe (with the same trace or span ID), you discover an error log indicating a database deadlock at that moment.

In this way, each signal leads to the next: metrics flagged the issue, tracing identified where it occurred, and logs revealed the why.

OpenTelemetry enables this seamless cross-analysis by correlating telemetry data.

It ensures that traces, metrics, and logs aren’t siloed data streams but interconnected signals. This unified approach improves incident response and debugging: Teams can detect anomalies via metrics, use tracing to pinpoint the affected components and follow the request path, and then consult logs to find the exact error or unusual event.

The result is faster root-cause analysis and more robust performance tuning.

By leveraging all three pillars through OpenTelemetry, even beginners and junior engineers can systematically approach problems in distributed systems, much like piecing together a story from different angles.

Conclusion

In summary, OpenTelemetry provides a standardized way to collect and unify traces, metrics, and logs, which are the core telemetry signals needed for observability.

A trace is the storyline of a request through various services, built from spans that detail each step.

Metrics are the numbers that monitor the system’s pulse over time, and logs are the detailed diary entries that explain events.

Together, these data sources fit together to give a 360° view of system health and behavior, empowering developers to monitor and troubleshoot distributed applications effectively.

Last updated on