CARM Tool: Cache-Aware Roofline Model Automatic Benchmarking and Application Analysis

📅 2026-05-28

📈 Citations: 0

✨ Influential: 0

career value

217K/year

🤖 AI Summary

Existing tools lack automated, cache-aware Roofline modeling support across multiple CPU architectures, hindering effective optimization guidance for high-performance computing applications. This work proposes CARM, the first unified and automated modeling framework spanning x86, ARM, and RISC-V architectures. CARM employs assembly-level microbenchmarks to automatically characterize computational throughput and bandwidth across the entire memory hierarchy, and integrates hardware performance counters with dynamic binary instrumentation for fine-grained bottleneck analysis. The framework supports vectorization across multiple instruction set architectures, achieving a maximum deviation of less than 1% in constructed performance roofs across diverse platforms. By delivering high accuracy and broad applicability, CARM significantly fills the tooling gap for architectures such as AMD and RISC-V in this domain.

📝 Abstract

In recent years, HPC systems and CPU architectures as their central components, have become increasingly complex, making application development and optimization quite challenging. In this respect, intuitive performance models like the Cache-aware Roofline Model (CARM) offer effective guidance by providing insights into bottlenecks that limit the application's ability to reach the system's maximum performance. To fully exploit the benefits of CARM optimization guidance for application development, automatic tools for cross-architecture model construction and in-depth application characterization are absolutely essential. Given a plethora of existing CPU architectures, the current landscape of CARM-enabled tools covers either vendor-specific (Intel Advisor), not sufficiently developed (ARM) or simply non-existing (AMD, RISC-V) tools. This is a particular gap that this work intends to close by bringing automatic CARM support to all major CPU architectures and ISAs, i.e., x86 (Intel, AMD), ARM, and RISC-V, by developing assembly microbenchmarks specifically tailored to cover a full performance spectrum of modern CPUs (from scalar to all supported vector ISA extensions) for both computational units and all memory hierarchy levels. Additionally, this work integrates application analysis within the CARM framework using performance counters and dynamic binary instrumentation. Experimental results show that the CARM roofs constructed with the proposed automated framework provide less than a 1% deviation across various tested architectural maximums.

Problem

Research questions and friction points this paper is trying to address.

Cache-aware Roofline Model

automatic benchmarking

cross-architecture

performance modeling

HPC optimization

Innovation

Methods, ideas, or system contributions that make the work stand out.

Cache-aware Roofline Model

automatic benchmarking

cross-architecture optimization