Optimal Software Pipelining and Warp Specialization for Tensor Core GPUs

📅 2025-12-19

📈 Citations: 0

✨ Influential: 0

career value

206K/year

🤖 AI Summary

Existing GPU compilers rely on fragile heuristics and manual expertise to jointly schedule software pipelining (SWP) and warp specialization (WS), hindering systematic exploration of the optimization space. This paper formulates SWP and WS as a joint constrained optimization problem—the first such formulation—and introduces Twill, the first heuristic-free, provably optimal, and architecture-agnostic automatic scheduling system. Its core innovation lies in constructing an exact scheduling model grounded in GPU instruction-level semantics, warp execution semantics, and Tensor Core hardware constraints, then employing iterative program abstraction and constraint solving to generate end-to-end optimal schedules. On Hopper and Blackwell architectures, Twill automatically reconstructs and formally verifies the optimality of the hand-tuned Flash Attention schedule, demonstrating both theoretical soundness and practical efficacy.

Technology Category

Application Category

📝 Abstract

GPU architectures have continued to grow in complexity, with recent incarnations introducing increasingly powerful fixed-function units for matrix multiplication and data movement to accompany highly parallel general-purpose cores. To fully leverage these machines, software must use sophisticated schedules that maximally utilize all hardware resources. Since realizing such schedules is complex, both programmers and compilers routinely employ program transformations, such as software pipelining (SWP) and warp specialization (WS), to do so in practice. However, determining how best to use SWP and WS in combination is a challenging problem that is currently handled through a mix of brittle compilation heuristics and fallible human intuition, with little insight into the space of solutions. To remedy this situation, we introduce a novel formulation of SWP and WS as a joint optimization problem that can be solved holistically by off-the-shelf constraint solvers. We reify our approach in Twill, the first system that automatically derives optimal SWP and WS schedules for a large class of iterative programs. Twill is heuristic-free, easily extensible to new GPU architectures, and guaranteed to produce optimal schedules. We show that Twill can rediscover, and thereby prove optimal, the SWP and WS schedules manually developed by experts for Flash Attention on both the NVIDIA Hopper and Blackwell GPU architectures.

Problem

Research questions and friction points this paper is trying to address.

Optimizing software pipelining and warp specialization jointly for GPUs

Automating schedule generation to replace manual heuristics and intuition

Ensuring optimal performance for iterative programs on Tensor Core GPUs

Innovation

Methods, ideas, or system contributions that make the work stand out.

Joint optimization of software pipelining and warp specialization

Heuristic-free scheduling using constraint solvers

Automatic generation of optimal GPU schedules for iterative programs

🔎 Similar Papers

No similar papers found.