Atys: An Efficient Profiling Framework for Identifying Hotspot Functions in Large-scale Cloud Microservices

📅 2025-06-18

📈 Citations: 0

✨ Influential: 0

career value

218K/year

🤖 AI Summary

For large-scale cloud-native microservice systems, identifying performance-critical (hot) functions across heterogeneous programming languages remains challenging due to fragmented tracing, high sampling overhead, and inefficient stack aggregation. To address these issues, this paper proposes an efficient, low-overhead cross-language hot-function identification method. We design a language-agnostic instrumentation adaptation mechanism, a two-level flame graph stack aggregation scheme, Function Selective Pruning (FSP), and State-aware Frequency Dynamic Adjustment (FDA). FSP reduces aggregation latency by 6.8% while maintaining a mean absolute percentage error of only 0.58%. FDA cuts sampling overhead by 87.6% without degrading mean squared error. The proposed approach significantly improves hot-spot localization accuracy and enhances observability efficiency in production-scale distributed environments.

Technology Category

Application Category

📝 Abstract

To handle the high volume of requests, large-scale services are comprised of thousands of instances deployed in clouds. These services utilize diverse programming languages and are distributed across various nodes as encapsulated containers. Given their vast scale, even minor performance enhancements can lead to significant cost reductions. In this paper, we introduce Atys1, an efficient profiling framework specifically designed to identify hotspot functions within large-scale distributed services. Atys presents four key features. First, it implements a language-agnostic adaptation mechanism for multilingual microservices. Second, a two-level aggregation method is introduced to provide a comprehensive overview of flamegraphs. Third, we propose a function selective pruning (FSP) strategy to enhance the efficiency of aggregating profiling results. Finally, we develop a frequency dynamic adjustment (FDA) scheme that dynamically modifies sampling frequency based on service status, effectively minimizing profiling cost while maintaining accuracy. Cluster-scale experiments on two benchmarks show that the FSP strategy achieves a 6.8% reduction in time with a mere 0.58% mean average percentage error (MAPE) in stack traces aggregation. Additionally, the FDA scheme ensures that the mean squared error (MSE) remains on par with that at high sampling rates, while achieving an 87.6% reduction in cost.

Problem

Research questions and friction points this paper is trying to address.

Identify hotspot functions in large-scale cloud microservices

Enable efficient profiling for multilingual distributed services

Reduce profiling cost while maintaining accuracy

Innovation

Methods, ideas, or system contributions that make the work stand out.

Language-agnostic adaptation for multilingual microservices

Two-level aggregation for comprehensive flamegraphs

Dynamic sampling frequency adjustment for cost efficiency

🔎 Similar Papers

Failure Diagnosis in Microservice Systems: A Comprehensive Survey and Analysis