🤖 AI Summary
This paper addresses the challenge of efficiently and portably deploying streaming-data algorithms on distributed and systolic architectures. To this end, we propose Cyclotron, a novel compilation framework. Methodologically, Cyclotron introduces recursive equations as a high-level abstraction to uniformly express streaming computation logic; employs an Iteration Space Intermediate Representation (ISIR) to explicitly model data locality and communication patterns, thereby decoupling scheduling policies from data layout; and leverages a recursive tensor language coupled with a custom scheduling language to generate processor-network–targeted send, receive, and compute instructions. Contributions include end-to-end compilation support across systolic arrays, Chiplet simulators, and distributed CPU clusters. Experimental results show that Cyclotron achieves performance on par with ScaLAPACK for matrix multiplication and triangular solve, while significantly improving memory access locality and pipeline execution efficiency.
📝 Abstract
We present Cyclotron, a framework and compiler for using recurrence equations to express streaming dataflow algorithms, which then get portably compiled to distributed topologies of interlinked processors. Our framework provides an input language of recurrences over logical tensors, which then gets lowered into an intermediate language of recurrences over logical iteration spaces, and finally into programs of send, receive, and computation operations specific to each individual processor. In Cyclotron's IR, programs are optimized such that external memory interactions are confined to the boundaries of the iteration space. Within inner iteration spaces, all data accesses become local: data accesses target values residing in local fast memory or on neighboring processing units, avoiding costly memory movement. We provide a scheduling language allowing users to define how data gets streamed and broadcasted between processors, enabling pipelined execution of computation kernels over distributed topologies of processing elements. We demonstrate the portability of our approach by compiling our IR to a reconfigurable simulator of systolic arrays and chiplet style distributed hardware, as well as to distributed-memory CPU clusters. In the simulated reconfigurable setting, we use our compiler for hardware design space exploration in which link costs and latencies can be specified. In the distributed CPU setting, we show how to use recurrences and our scheduling language to express various matrix multiplication routines (Cannon, SUMMA, PUMMA, weight stationary) and solvers (Triangular solve and Cholesky). For matrix multiplication and the triangular solve, we generate distributed implementations competitive with ScaLAPACK.