Building an Open Source Observability Platform

Enterprises are under constant pressure to ensure their IT infrastructure and applications are available throughout the year. The complexity of modern architectures (containers, hybrid cloud, SOA, microservices, etc.) is constantly growing, generating vast volumes of unmanageable logs. We need intelligent application performance management (APM) and observability tools to achieve production excellence and meet the availability and uptime goals. These include analyzing application health, performance and user experience. Adopting machine learning techniques to identify anomalies and behavior patterns will help detect the root cause early and meet customer service level agreements (SLA).

The APM and observability tools market is undoubtedly hot. These tools ingest multiple telemetry data feeds and are powerful analytics platforms providing critical insights into application and infrastructure health, including system performance. The software development teams that adopt observability are much better equipped to release their application code iteratively. As per “MarketsandMarkets” research, the market size of observability tools and platforms is expected to grow from $2.4 billion in 2023 to $4+ billion by 2028 at a compound annual growth rate (CAGR) of 11.7%.

What is Observability?

Observability is the ability to collect data about the distributed applications, infrastructure and communication among its internal and external components and services, allowing teams to debug their systems diligently. It enables the site reliability engineering (SRE), software engineering and operations teams to analyze the customer impact and triage a service outage. Observability and monitoring are sometimes used interchangeably. Observability (proactive) makes the data accessible and allows you to ask any system question to know how the code behaves more deeply. Monitoring (reactive) is the task of collecting and displaying that data and the ability to determine the system’s overall state.
Observability can be further broken down into three key pillars: Logs, traces and metrics, which are essential for SRE observability.

• Logs help us diagnose the issues and tell us why they happened.
• Traces help us isolate issues and tell us where they happened.
• Metrics help us detect the issues and tell us what happened.

Market Tools, Capabilities and Challenges

Gartner’s magic quadrant for APM and observability has identified 20+ vendor products offering APM and observability capabilities, including self-hosted, vendor-managed or SaaS deployments. These products provide multiple features, including application performance metrics, monitoring and alerting on events, traceability, anomaly detection and vulnerabilities, etc.

An enterprise business application comprises homegrown applications (such as .NET, Java, Python, SQL, NoSQL DB, etc.), third-party/off-the-shelf products (such as Salesforce, HubSpot, etc.), and integrations (such as Stripe, PayPal, etc.). Homegrown applications are hosted in an on-premises data center or by cloud vendors like AWS, GCP, or Azure. The off-the-shelf products are SaaS-based or integrated via APIs. There are highly distributed applications spanning tens and hundreds of nodes, services and instances.

• Too Many Tools: Enterprise applications use various tools for monitoring application health and performance (such as New Relic, Data Dog, etc.), Error logging (such as Splunk), and cloud vendor-provided tools (such as CloudWatch). These products overlap in functionality and maintaining and managing these tools (procurement, learning curve, etc.) can be cumbersome.

• Unpredictable Data Volume: Imagine the volume of observability data (logs, traces, metrics) collected based on the application traffic, usage, dependency on external products, etc. The amount of data storage required to consolidate these data feeds can quickly grow out of control.

• Pricing is Complex: These vendor products also offer different pricing models such as charge per host (such as Splunk, Data Dog, Dynatrace), per-user (such as New Relic), per-ingestion (such as SumoLogic, AppDynamics). The complexity of pricing models makes it challenging to compare total cost of ownership (TCO) among vendors and determine the right tool that fits your requirements and budget.

Why Choose Open Source Observability Platform?

Open source-based observability tools aim to provide a standard, vendor-agnostic approach for ingesting, transforming and sending data to an observability backend. The open source observability tools can be an alternative to save on licensing costs and consolidate multiple APM tools with the tool that fits your requirements and budget.

However, maintaining open-source systems may require efforts to set up and maintain and will add to your initial operational cost. But in the long term, you will save on licensing fees and avoid vendor lock-in and contractual agreements.

Gartner predicts that, by 2025, 70% of new cloud-native application monitoring will use open-source instrumentation rather than vendor-specific agents for improved interoperability, and 70% of new cloud-native applications will adopt OpenTelemetry for observability rather than vendor-specific agents and software development kits (SDKs).

Scale Observability Using the Open Source Ecosystem

The open source landscape for observability is quite dynamic. There are multiple Cloud Native Computing Foundation (CNCF) open source tools for observability and monitoring. This post will primarily focus on the OpenTelemetry framework and LGTM technology stack.

OpenTelemetry:
The “too many tools” challenge described above brings up a new struggle in telemetry data collection. Each tool vendor has its own APIs, SDKs, agents and collectors for logs, metrics and traces. We need a unified telemetry collection using the OpenTelemetry framework to create and manage telemetry data such as logs, traces and metrics.

The OTEL project under the auspices of the CNCF offers a unified set of vendor-agnostic APIs, SDKs, and tools for generating and collecting telemetry data and exporting it to various analysis tools. You get one API and SDK per programming language to extract your application’s observability data, a standard collector, a transmission protocol (OTLP) and more.

LGTM:
The most popular open source-based observability and monitoring is implemented using the LGTM technology stack.

In the LGTM stack, we leverage:
• Loki for log aggregation
• Grafana dashboards for telemetry visualizations
• Tempo (or Jaeger) for trace aggregation
• Managed Prometheus for metrics aggregation

Conclusion

Observability is about complete visibility across your systems and tying business metrics with technical data. Monitoring is about understanding if things are working correctly, and AIOps is about getting meaning from that visibility. Observability and monitoring are critical to ensure the smooth functioning of applications and meet customer SLAs. In conclusion, by investing in open source OTel frameworks and LGTM tools, SRE teams can effectively monitor their applications and gain insights into system behavior and potential issues. These tools offer cost-effectiveness and customization to suit specific requirements. It promotes vendor neutrality, which can be critical for avoiding vendor lock-in.