🤖 AI Summary
Deep learning domain-specific languages (DSLs) such as Triton require expertise in parallel programming and expose low-level hardware details, leading to high kernel development and maintenance costs. To address this, we propose NineToothed—a high-level DSL for machine learning that supports sequential programming while automatically generating efficient parallel code. Our key contributions are: (1) tensor-oriented metaprogramming (TOM), enabling abstract, block-wise computation specification; (2) an “arrange-apply” paradigm that decouples algorithmic logic from hardware-specific scheduling; and (3) a fully automated sequential-to-parallel translation framework with a high-performance code generator. Evaluation shows that NineToothed achieves near-Triton performance—with average overhead under 5%—while significantly reducing development complexity, improving maintainability, and enhancing programmer productivity.
📝 Abstract
The emergence of deep learning domain-specific languages (DSLs) has substantially reduced the obstacles in developing high-performance, cross-platform compute kernels. However, current DSLs, such as Triton, still demand that developers possess expertise in parallel programming and expose them to many low-level details. This requirement complicates the development process and adds to the difficulty of maintaining compute kernels. Consequently, developing a new programming model that supports serial programming for deep learning workloads is crucial.
This paper introduces NineToothed, a domain-specific language that offers serial semantics for machine learning programming. Through the automatic transformation of serial code into parallel code, NineToothed significantly streamlines the development process while causing minimal performance degradation. NineToothed encompasses (1) a language with tensor-oriented metaprogramming (TOM) that adopts the arrange-and-apply paradigm, enabling the expression of tiled computations without the need to manage low-level details and (2) a code generator for generating high-performance parallel code. Our evaluation results indicate that NineToothed can greatly simplify compute kernel development while maintaining performance comparable to that of Triton.