The biggest compliment you can pay a technology standard is to say that it is “boring.”
Transport layer standards like TCP and UDP are boring. They are the underlying protocols powering pretty much everything we do on the internet. REST is boring. It brokers trillions of API transactions across the world’s distributed microservices.
Infrastructure standards get boring when they become common infrastructure plumbing that “just works.” They give open source creators and maintainers common foundations to innovate on top of. They give users confidence that their technology choices aren’t leading to dead ends. Look at any major evolution in computing, and it’s always propped up by boring standards.
When OpenTelemetry debuted at KubeCon in 2019, telemetry data for applications in distributed systems was far from boring. Two competing projects—OpenTracing and OpenCensus—were trying to encapsulate best practices around the still nascent distributed tracing discipline. They were “very similar in that they aim[ed] to unify app instrumentation and make observability a built-in feature in every modern application” (Merging OpenTracing and OpenCensus: Goals and Non-Goals). But having two fragmented communities around these two projects made it hard for distributed tracing to move forward—so it was a big deal when the two projects merged and OpenTelemetry was born.
Let’s take a look at how the OpenTelemetry standard is making this all possible.
Standard Semantic Conventions and a Single ‘Mental Model’
Semantic conventions are important to standardize how we describe common operations so that we have consistent telemetry for similar actions. Developers instrumenting applications need to be able to relate to clear semantic definitions to understand how to instrument everything from database drivers to SQL statements and how to consume that telemetry data for analysis.
For example, OpenTelemetry semantic conventions define where in the metadata the process name should be stored, but not what the process name should be. Similarly, if I’m deploying my application on Kubernetes on a cloud provider somewhere, I’d expect my telemetry data to have the Kubernetes pod and namespace it originates from, as well as the cluster and region.
OpenTelemetry also standardizes metadata for labels that span languages. So if I have an HTTP service and request, it doesn’t matter if it is in Java, Python or any other language — with OpenTelemetry, I can just start querying the data without having to think about the language behind it.
It brings a consistent mental model to how the different telemetry data types (traces, metrics, logs) are related: OpenTelemetry gives them the same base, so understanding how a span is modeled provides knowledge that can be reused when understanding metrics and logs. It can be used by any microservice and any HTTP- or gRPC-connected service or device. For any network device, OpenTelemetry gives developers semantic conventions that bring consistency to reporting telemetry data on any type of event. So, any engineer working with OpenTelemetry can compare latencies and queries between services—even if they are written in different languages—because of the semantic conventions.
Protecting Users From Vendor Lock-In
A decade ago, monitoring and logging vendors provided agents to their customers. They didn’t call it observability back then. But they told the market that if you want the best telemetry data, you need to use our agent. Every vendor had its own agent.
Adding these agents was risky and took a lot of time. Once a vendor got its agents into a customer’s environment, they were hard to change and you had to either pay the vendor or your team of engineers to remove that instrumentation in favor of another.
OpenTelemetry really commoditized the telemetry definition in such a way that we don’t need to use agents from vendors anymore. This means users can decide which vendor to use for observability tooling and have the flexibility to switch to different vendors down the line. With OpenTelemetry, they no longer have to instrument applications, restart applications and get into risky operations cycles to change the destination of their telemetry data.
Economies of Community – Scale for Drivers and Integrations
OpenTelemetry has achieved a very large and diverse set of contributors in a relatively short period of time. It’s become the second most popular CNCF project, behind only Kubernetes.
The project is drawing contributions both on the instrumentation side of things, as well as the collection side of things. Specifications are another big area where the project is evolving, including the API/SDK spec, the data model and the OTel protocol (OTLP). Accelerating all of this is the fact that OpenTelemetry has broad participation from the logging, monitoring and observability vendor ecosystem, which comprises many of its largest contributors.
We still have a ways to go. But in the last year, it’s become obvious that OpenTelemetry is so commonly regarded as the standard for tracing data that the language, library and database creators and maintainers are instrumenting for OpenTelemetry. This is the ultimate sign of maturity for any project or standard—when its popularity has created pull from maintainers designing around the spec and using it natively.
What’s Next For OpenTelemetry?
Looking to the future, I think we’re going to see an unintended consequence where OpenTelemetry moves the industry from operational data to more data intelligence. OpenTelemetry makes it much easier to instrument how users are interacting with systems and enables developers to add business metrics and compare those against actual user behavior.
OpenTelemetry also has a great potential marriage with eBPF and related projects like Cilium. These abstractions that make the Linux kernel “programmable” add a pool of much finer-grained telemetry data that can be extracted into logs from applications during run-time.
This all feeds into the concept of “Profiles” that OpenTelemetry is implementing. While distributed tracing emphasizes the horizontal cut of how a specific transaction is running or all the services that are tied through a single request, a profile is a different vertical cut that emphasizes the state of resources like memory, CPU and more at a point in time. Those profiles are collections that are stored continuously and another great slice of telemetry data that will be encapsulated by the OpenTelemetry standard.