Atys: An Efficient Profiling Framework for Identifying Hotspot Functions in Large-scale Cloud Microservices

📅 2025-06-18
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
For large-scale cloud-native microservice systems, identifying performance-critical (hot) functions across heterogeneous programming languages remains challenging due to fragmented tracing, high sampling overhead, and inefficient stack aggregation. To address these issues, this paper proposes an efficient, low-overhead cross-language hot-function identification method. We design a language-agnostic instrumentation adaptation mechanism, a two-level flame graph stack aggregation scheme, Function Selective Pruning (FSP), and State-aware Frequency Dynamic Adjustment (FDA). FSP reduces aggregation latency by 6.8% while maintaining a mean absolute percentage error of only 0.58%. FDA cuts sampling overhead by 87.6% without degrading mean squared error. The proposed approach significantly improves hot-spot localization accuracy and enhances observability efficiency in production-scale distributed environments.

Technology Category

Application Category

📝 Abstract
To handle the high volume of requests, large-scale services are comprised of thousands of instances deployed in clouds. These services utilize diverse programming languages and are distributed across various nodes as encapsulated containers. Given their vast scale, even minor performance enhancements can lead to significant cost reductions. In this paper, we introduce Atys1, an efficient profiling framework specifically designed to identify hotspot functions within large-scale distributed services. Atys presents four key features. First, it implements a language-agnostic adaptation mechanism for multilingual microservices. Second, a two-level aggregation method is introduced to provide a comprehensive overview of flamegraphs. Third, we propose a function selective pruning (FSP) strategy to enhance the efficiency of aggregating profiling results. Finally, we develop a frequency dynamic adjustment (FDA) scheme that dynamically modifies sampling frequency based on service status, effectively minimizing profiling cost while maintaining accuracy. Cluster-scale experiments on two benchmarks show that the FSP strategy achieves a 6.8% reduction in time with a mere 0.58% mean average percentage error (MAPE) in stack traces aggregation. Additionally, the FDA scheme ensures that the mean squared error (MSE) remains on par with that at high sampling rates, while achieving an 87.6% reduction in cost.
Problem

Research questions and friction points this paper is trying to address.

Identify hotspot functions in large-scale cloud microservices
Enable efficient profiling for multilingual distributed services
Reduce profiling cost while maintaining accuracy
Innovation

Methods, ideas, or system contributions that make the work stand out.

Language-agnostic adaptation for multilingual microservices
Two-level aggregation for comprehensive flamegraphs
Dynamic sampling frequency adjustment for cost efficiency
🔎 Similar Papers
No similar papers found.
Jiaqi Sun
Jiaqi Sun
Carnegie Mellon University
Causalitygraph representation learning
Dingyu Yang
Dingyu Yang
Zhejiang University
DatabasePerformance EvaluationDistributed Processing
Shiyou Qian
Shiyou Qian
Shanghai Jiao Tong University
Computer Science
J
Jian Cao
Department of Computer Science, Shanghai Jiao Tong University, 800 Dongchuan RD, Shanghai, 200240, Shanghai, China; Shanghai Key Laboratory of Trusted Data Circulation and Governance in Web3, 800 Dongchuan RD, Shanghai, 200240, Shanghai, China
Guangtao Xue
Guangtao Xue
Professor of Computer Science, Shanghai Jiao Tong University
Mobile ComputingSocial NetworksWireless Sensor NetworksDistributed Computing