Enhancing API Observability Series (Part 3): Tracing

Introduction

What Is Distributed Tracing?

In today's landscape of microservices architecture and distributed systems, a single request often traverses multiple services, each containing various internal processing steps. To ensure the efficient and stable operation of systems, it's imperative to comprehend the complete path and processing of these requests. This necessity gave rise to distributed tracing technology. It enables us to gain a clear understanding of the entire journey of a request from initiation to completion, including every service it passes through, each service's processing time, and the occurrence of any anomalies.

Process of Utilizing Tracing in APIs

Initially, deploying a tracing agent or SDK for tracing at the entry point of each service is essential to capture all requests. Through tracing data, we can distinctly observe the processing time of requests in each service, thereby identifying potential performance bottlenecks. Distributed tracing not only records the normal request processing but also captures any exceptions and errors occurring during the process. By visualizing the distributed tracing data, we can intuitively comprehend the flow of requests between different services and components.

Enhancing Observability - Distributed Tracing

Regarding distributed tracing, here are some methods to enhance API observability along with specific examples:

1. Selecting Appropriate Distributed Tracing Tools and Technologies

When choosing distributed tracing tools, considerations include your technology stack, business requirements, and monitoring complexity. Zipkin, SkyWalking, and OpenTelemetry are popular distributed tracing solutions, each with its unique features.

2. Integrating Distributed Tracing into API Development

For Zipkin and SkyWalking, integration into API development can be achieved by adding the respective dependencies and configurations. However, for OpenTelemetry, manual creation and management of tracing context using its API are required.

3. Configuring and Optimizing Distributed Tracing Systems

Zipkin, SkyWalking, and OpenTelemetry can all be customized through configuration files. Parameters such as sampling rate, backend storage configuration, and data transmission optimization can be set. Additionally, defining alert rules to promptly respond to exceptional events is crucial.

4. Data Analysis and Visualization

Zipkin, SkyWalking, and OpenTelemetry provide visualization interfaces to display distributed tracing data and performance metrics. For instance, in Zipkin's UI, specific trace data can be searched and viewed to understand the flow of requests between different services. SkyWalking's dashboard offers a global performance overview and service call relationship graph. OpenTelemetry data can be imported into various visualization tools like Grafana to create custom dashboards and charts.

5. API7 Enterprise Integrets with Distributed Tracing Plugins

API7 Enterprise supports multiple tracing plugins, including Zipkin, OpenTracing, and SkyWalking. These tracing plugins need to be bound to routing rules or global rules. If there are no sampling rate requirements, it's advisable to bind them to global rules to prevent omissions.

Practical Case Analysis: Improving Observability of E-commerce APIs

During the process of browsing and purchasing products on an e-commerce platform, multiple API calls are involved. For instance, users initially call the product service's API to retrieve a list of products, then select a specific product and call the order service's API to create an order, and finally call the payment service's API to complete the payment.

In this scenario, it was noticed that the order service's API often experienced delays and timeouts during peak periods, resulting in noticeable delays and failures during the checkout process. To address this issue, the team decided to introduce distributed tracing technology to diagnose performance bottlenecks and optimize the system.

Selecting Distributed Tracing Tools: The team chose SkyWalking as the distributed tracing tool due to its support for multiple languages, ease of integration, and rich visualization capabilities.
Integrating SkyWalking: The order service is developed in Java, and the team integrated SkyWalking's Java Agent into the order service's code. This allows SkyWalking to automatically collect tracing data when the order service's API is called.
Configuring SkyWalking: The team configured SkyWalking's backend storage to Elasticsearch and set appropriate sampling rates to balance the level of detail in tracing data and storage costs.
Collecting and Analyzing Tracing Data: During peak periods, the team observed the call chain and performance metrics of the order service's API through SkyWalking's UI. They found that a particular call to the product inventory service's API took significantly longer during the order creation process, becoming a performance bottleneck.
In-depth Investigation: The team further examined detailed tracing data of the product inventory service's API, including call parameters, return results, and exception information. They discovered that the API executed a complex database query operation when processing specific products, leading to increased processing time.
Optimization Measures: To address this issue, the team implemented two optimization measures. Firstly, they optimized the database query statements to improve query efficiency. Secondly, they implemented caching for the product inventory service's API, retrieving results directly from the cache for frequently queried and infrequently changing products, thereby avoiding unnecessary database queries.

Conclusion

Distributed tracing technology plays a crucial role in the microservices architecture and distributed systems. By recording and visualizing the flow of requests among multiple services, we can quickly identify and address performance bottlenecks, enhancing the stability and observability of the system. By selecting appropriate distributed tracing tools and integrating them into API development, we can gain deeper insights into system operations, thereby improving user experience and system efficiency.