Hexcute: A Tile-based Programming Language with Automatic Layout and Task-Mapping Synthesis

📅 2025-04-22
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Optimizing mixed-precision deep learning operators—e.g., mixed-type matrix multiplication—on GPUs remains challenging: high-level compilers (e.g., Triton) lack expressive power for fine-grained hardware control, while low-level libraries (e.g., CUTLASS) incur prohibitive development overhead. To bridge this gap, we propose Hexcute, a novel tiled programming language. Its core innovation is the first type-inference-driven synthesis algorithm that jointly optimizes data layout and task mapping; it explicitly exposes shared memory and register abstractions, enables fine-grained data pipelining, and enforces hardware-friendly memory layouts. Hexcute achieves high expressivity while drastically reducing GPU programming complexity. Experiments show that Hexcute accelerates mixed-precision operators by 1.7×–11.28× over state-of-the-art deep learning compilers, delivers up to 2.91× end-to-end speedup, and generalizes effectively across diverse deep learning operators.

Technology Category

Application Category

📝 Abstract
Deep learning (DL) workloads mainly run on accelerators like GPUs. Recent DL quantization techniques demand a new matrix multiplication operator with mixed input data types, further complicating GPU optimization. Prior high-level compilers like Triton lack the expressiveness to implement key optimizations like fine-grained data pipelines and hardware-friendly memory layouts for these operators, while low-level programming models, such as Hidet, Graphene, and CUTLASS, require significant programming efforts. To balance expressiveness with engineering effort, we propose Hexcute, a tile-based programming language that exposes shared memory and register abstractions to enable fine-grained optimization for these operators. Additionally, Hexcute leverages task mapping to schedule the GPU program, and to reduce programming efforts, it automates layout and task mapping synthesis with a novel type-inference-based algorithm. Our evaluation shows that Hexcute generalizes to a wide range of DL operators, achieves 1.7-11.28$ imes$ speedup over existing DL compilers for mixed-type operators, and brings up to 2.91$ imes$ speedup in the end-to-end evaluation.
Problem

Research questions and friction points this paper is trying to address.

Addresses mixed-type matrix multiplication in DL workloads
Balances expressiveness and effort in GPU optimization
Automates layout and task mapping for DL operators
Innovation

Methods, ideas, or system contributions that make the work stand out.

Tile-based language for fine-grained optimization
Automates layout and task mapping synthesis
Type-inference algorithm reduces programming effort
🔎 Similar Papers
No similar papers found.