๐ค AI Summary
Modern GPU programming faces a fundamental trade-off between abstraction level and hardware control: excessive abstraction impedes performance optimization, while overly low-level approaches impose significant development burdens. This work proposes TLX, an extension to the Triton language based on a Multi-Instruction, Multi-Warp (MIMW) execution modelโthe first to integrate MIMW into a high-level GPU programming framework. TLX operates at the warp-group granularity, explicitly supporting multi-warp scheduling, shared memory orchestration, asynchronous operations, and cluster-aware control flow. It preserves Tritonโs elegant block-level programming model while enabling efficient exploitation of native hardware features. Experimental results demonstrate that TLX kernels achieve state-of-the-art performance with substantially reduced development effort and have been successfully deployed in large-scale training and inference systems.
๐ Abstract
Modern GPUs increasingly rely on specialized hardware units and asynchronous coordination mechanisms, so performance depends on orchestrating data movement, tensor-core computation, and synchronization rather than exposing more thread-level parallelism. This creates a programming-model tension: if too much execution structure is hidden, the compiler must catch up to new hardware mechanisms; if too much is exposed, the burden of orchestration falls back onto the programmer. We present TLX (Triton Low-level Language Extensions), built around MIMW (Multi-Instruction, Multi-Warp), which expresses orchestration at warp-group granularity while preserving Triton's productive blocked programming model for regular computation. TLX realizes this idea as an embedded extension to Triton, exposing explicit interfaces for multi-warp execution, local-memory orchestration, asynchronous operations, and cluster-aware control. Our evaluation shows that TLX supports substantial customization with limited development effort while remaining competitive with state-of-the-art implementations. TLX-authored kernels have been deployed in large-scale training and inference production systems. Our code is open sourced at https://github.com/facebookexperimental/triton.