When Should I Run My Application Benchmark?: Studying Cloud Performance Variability for the Case of Stream Processing Applications

📅 2025-04-16
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This study addresses the high variability and low reliability of performance benchmarking results for stream-processing applications in cloud environments. Over three months, we conducted a large-scale longitudinal empirical study across multiple geographic regions and heterogeneous hardware—including diverse CPU architectures. Leveraging Kubernetes-based automated deployment, high-frequency repeated benchmarking, and time-series statistical analysis, we systematically characterized end-to-end cloud performance variability for the first time at the application level. We discovered that variability exhibits statistically significant diurnal and weekly periodicity (amplitude ≤2.5%) and a coefficient of variation <3.7%—substantially lower than commonly assumed in industry. Moreover, infrastructure sharing incurs at most a 2.5-percentage-point loss in measurement precision. These findings demonstrate strong robustness across regions and CPU architectures, providing empirical evidence and methodological foundations for enhancing reproducibility and trustworthiness in cloud-native benchmarking.

Technology Category

Application Category

📝 Abstract
Performance benchmarking is a common practice in software engineering, particularly when building large-scale, distributed, and data-intensive systems. While cloud environments offer several advantages for running benchmarks, it is often reported that benchmark results can vary significantly between repetitions -- making it difficult to draw reliable conclusions about real-world performance. In this paper, we empirically quantify the impact of cloud performance variability on benchmarking results, focusing on stream processing applications as a representative type of data-intensive, performance-critical system. In a longitudinal study spanning more than three months, we repeatedly executed an application benchmark used in research and development at Dynatrace. This allows us to assess various aspects of performance variability, particularly concerning temporal effects. With approximately 591 hours of experiments, deploying 789 Kubernetes clusters on AWS and executing 2366 benchmarks, this is likely the largest study of its kind and the only one addressing performance from an end-to-end, i.e., application benchmark perspective. Our study confirms that performance variability exists, but it is less pronounced than often assumed (coefficient of variation of<3.7%). Unlike related studies, we find that performance does exhibit a daily and weekly pattern, although with only small variability (<= 2.5%). Re-using benchmarking infrastructure across multiple repetitions introduces only a slight reduction in result accuracy (<= 2.5 percentage points). These key observations hold consistently across different cloud regions and machine types with different processor architectures. We conclude that for engineers and researchers focused on detecting substantial performance differences (e.g.,>5%) in...
Problem

Research questions and friction points this paper is trying to address.

Quantify cloud performance variability impact on benchmarks
Assess temporal effects on stream processing applications
Evaluate benchmark result accuracy across cloud regions
Innovation

Methods, ideas, or system contributions that make the work stand out.

Longitudinal study on cloud performance variability
Large-scale Kubernetes clusters deployment on AWS
Application benchmark with end-to-end perspective
🔎 Similar Papers
No similar papers found.
S
Soren Henning
Dynatrace Research, Linz, Austria
A
Adriano Vogel
Dynatrace Research, Linz, Austria
E
Esteban Perez-Wohlfeil
Dynatrace Research, Linz, Austria
O
Otmar Ertl
Dynatrace Research, Linz, Austria
Rick Rabiser
Rick Rabiser
Professor at LIT CPS Lab, Johannes Kepler University Linz
Software Engineering