Long-term Monitoring of Kernel and Hardware Events to Understand Latency Variance

📅 2026-01-15

📈 Citations: 0

✨ Influential: 0

career value

255K/year

🤖 AI Summary

This work addresses the challenge that application-level latency jitter often stems from elusive kernel- and hardware-level events. To this end, the authors propose VarMRI, a novel toolchain featuring a selective event logging mechanism tailored for long-term monitoring and a hierarchical information collection strategy that balances data volume with completeness and interpretability. By integrating efficient log analysis with system-wide performance monitoring, VarMRI precisely identifies critical latency sources—including interrupt preemption, Java garbage collection, and pipeline stalls—over 3,000 hours of experimentation. Targeted optimizations guided by these insights reduce tail latency by up to 31%.

Technology Category

Application Category

📝 Abstract

This paper presents our experience to understand latency variance caused by kernel and hardware events, which are often invisible at the application level. For this purpose, we have built VarMRI, a tool chain to monitor and analyze those events in the long term. To mitigate the"big data"problem caused by long-term monitoring, VarMRI selectively records a subset of events following two principles: it only records events that are affecting the requests recorded by the application; it records coarse-grained information first and records additional information only when necessary. Furthermore, VarMRI introduces an analysis method that is efficient on large amount of data, robust on different data set and against missing data, and informative to the user. VarMRI has helped us to carry out a 3,000-hour study of six applications and benchmarks on CloudLab. It reveals a wide variety of events causing latency variance, including interrupt preemption, Java GC, pipeline stall, NUMA balancing etc.; simple optimization or tuning can reduce tail latencies by up to 31%. Furthermore, the impacts of some of these events vary significantly across different experiments, which confirms the necessity of long-term monitoring.

Problem

Research questions and friction points this paper is trying to address.

latency variance

kernel events

hardware events

long-term monitoring

tail latency

Innovation

Methods, ideas, or system contributions that make the work stand out.

long-term monitoring

latency variance

kernel events