🤖 AI Summary
For large-scale cloud-native microservice systems, identifying performance-critical (hot) functions across heterogeneous programming languages remains challenging due to fragmented tracing, high sampling overhead, and inefficient stack aggregation. To address these issues, this paper proposes an efficient, low-overhead cross-language hot-function identification method. We design a language-agnostic instrumentation adaptation mechanism, a two-level flame graph stack aggregation scheme, Function Selective Pruning (FSP), and State-aware Frequency Dynamic Adjustment (FDA). FSP reduces aggregation latency by 6.8% while maintaining a mean absolute percentage error of only 0.58%. FDA cuts sampling overhead by 87.6% without degrading mean squared error. The proposed approach significantly improves hot-spot localization accuracy and enhances observability efficiency in production-scale distributed environments.
📝 Abstract
To handle the high volume of requests, large-scale services are comprised of thousands of instances deployed in clouds. These services utilize diverse programming languages and are distributed across various nodes as encapsulated containers. Given their vast scale, even minor performance enhancements can lead to significant cost reductions. In this paper, we introduce Atys1, an efficient profiling framework specifically designed to identify hotspot functions within large-scale distributed services. Atys presents four key features. First, it implements a language-agnostic adaptation mechanism for multilingual microservices. Second, a two-level aggregation method is introduced to provide a comprehensive overview of flamegraphs. Third, we propose a function selective pruning (FSP) strategy to enhance the efficiency of aggregating profiling results. Finally, we develop a frequency dynamic adjustment (FDA) scheme that dynamically modifies sampling frequency based on service status, effectively minimizing profiling cost while maintaining accuracy. Cluster-scale experiments on two benchmarks show that the FSP strategy achieves a 6.8% reduction in time with a mere 0.58% mean average percentage error (MAPE) in stack traces aggregation. Additionally, the FDA scheme ensures that the mean squared error (MSE) remains on par with that at high sampling rates, while achieving an 87.6% reduction in cost.