🤖 AI Summary
Optimizing compound operations—such as GEMM–Softmax and self-attention—in emerging deep neural networks (e.g., large language models) is challenging due to tight coupling between compute-intensive kernels and distributed collective communication. Traditional approaches decouple computation and communication or model only single operators, failing to capture cross-cluster latency and energy overheads holistically.
Method: This paper proposes the first framework that explicitly co-models computational dataflow and inter-cluster communication costs. It introduces a fine-grained dependency representation between GEMM and non-GEMM operations, a novel dataflow abstraction, and integrated latency and energy models for end-to-end optimization of compound operations.
Contribution/Results: By deeply embedding collective communication cost into the dataflow optimization loop, the framework achieves 1.42×–3.46× speedup over non-fused baselines across diverse accelerator configurations, while significantly reducing off-chip memory traffic and energy consumption—overcoming fundamental limitations of prior operator-isolated and communication-computation decoupled methods.
📝 Abstract
Modern machine learning accelerators are designed to efficiently execute deep neural networks (DNNs) by optimizing data movement, memory hierarchy, and compute throughput. However, emerging DNN models such as large language models, state space models increasingly rely on compound operations-structured compositions of multiple basic operations-which introduce new challenges for dataflow optimization and minimizing off-chip memory traffic. Moreover, as model size continues to grow, deployment across spatially distributed compute clusters becomes essential, requiring frequent and complex collective communication. Existing dataflow optimization frameworks and performance models either focus on single operations or lack explicit modeling of collective communication cost, limiting their applicability to modern workloads.
To address these limitations, we propose, a framework for modeling and optimizing dataflow for compound operations on machine learning accelerators. COMET introduces a novel representation that explicitly models collective communication across spatial clusters, along with latency and energy cost models that account for both GEMM and non-GEMM operation level dependencies within compound operations. We demonstrate COMET's capabilities to analyze and optimize dataflows for compound operations such as GEMM--Softmax, GEMM--LayerNorm, and self-attention, across both edge and cloud accelerator configurations. Our collective-aware modeling enables exploration of a broader mapping space, leading to improved performance and energy efficiency. Specifically, our optimized dataflows achieve up to 1.42$ imes$ speedup for GEMM-Softmax, 3.46$ imes$ for GEMM-LayerNorm and 1.82$ imes$ for self-attention compared to unfused baselines.