Can Asymmetric Tile Buffering Be Beneficial?

📅 2025-11-19

📈 Citations: 0

✨ Influential: 0

career value

203K/year

🤖 AI Summary

Traditional symmetric tiling buffers impose strict dimensional matching between input matrix A and output matrix C along the M-dimension, constraining cache efficiency and arithmetic intensity optimization for GEMM. This work proposes Asymmetric Tiling Buffering (ATB), the first technique to decouple buffer dimensioning for input and output operands while preserving computational correctness—thereby enhancing data reuse and computational density. We develop an analytical model that jointly optimizes arithmetic intensity gains against kernel-switching overhead, and implement end-to-end optimized BFP16-BF16 GEMM on the AMD XDNA2 AI Engine. Experimental results demonstrate ATB’s practical efficacy: peak performance improves from 4.8 to 24.6 TFLOPS—a maximum 4.54× speedup—and establishes a new platform record for GEMM throughput.

Technology Category

Application Category

📝 Abstract

General matrix multiplication (GEMM) is the computational backbone of modern AI workloads, and its efficiency is critically dependent on effective tiling strategies. Conventional approaches employ symmetric tile buffering, where the buffered tile size of the input $A$ along the dimension $M$ matches the output tile size of $C$. In this paper, we introduce asymmetric tile buffering (ATB), a simple but powerful technique that decouples the buffered tile dimensions of the input and output operands. We show, for the first time, that ATB is both practical and highly beneficial. To explain this effect, we develop a performance model that incorporates both the benefits of ATB (higher arithmetic intensity) and its overheads (higher kernel switching costs), providing insight into how to select effective ATB tiling factors. As a case study, we apply ATB to AMD's latest XDNA2 AI Engine (AIE), achieving up to a 4.54x speedup, from 4.8 to 24.6 TFLOPS on mixed-precision BFP16--BF16 GEMM, establishing a new performance record for XDNA2 AIE.

Problem

Research questions and friction points this paper is trying to address.

Optimizing asymmetric tile buffering for GEMM efficiency

Modeling performance trade-offs in tile dimension decoupling

Achieving speedup on AMD XDNA2 AI Engine architecture

Innovation

Methods, ideas, or system contributions that make the work stand out.

Asymmetric tile buffering decouples input and output operand dimensions.

Performance model balances arithmetic intensity and switching costs.

Applied to AMD XDNA2 AI Engine achieving 4.54x speedup.

🔎 Similar Papers

No similar papers found.