🤖 AI Summary
Manually constructing efficient dataflow architectures on FPGAs faces significant challenges in balancing performance and resource utilization, a problem that persists even when using high-level synthesis (HLS) tools. This work proposes CODO, the first end-to-end compiler capable of automatically detecting and repairing dataflow violations across multiple granularities while jointly optimizing on-chip and off-chip data movement and generating efficient schedules. By integrating dataflow compliance analysis, communication optimization, and automated scheduling, CODO achieves 1.45–4.52× latency speedup on representative compute kernels and delivers 3.7–33.8× acceleration on DNN models. Board-level evaluations demonstrate an average 7.3× speedup for CNNs and a 2.07× speedup for GPT-2, substantially outperforming existing frameworks.
📝 Abstract
FPGAs are well-suited for dataflow architectures that process data in a streaming or pipelined manner, thus satisfying the high computational and communication demands of emerging applications. However, manually implementing an efficient dataflow architecture for large-scale applications is still challenging, even for specialists who use high-level synthesis (HLS) to simplify FPGA programming.
To address this, we introduce CODO, an automated compiler that generates feasible and efficient dataflow accelerators on FPGAs. CODO features a systematic method for detecting and eliminating both coarse-grained and fine-grained dataflow violations. Building on this, CODO performs both on- and off-chip data movement optimizations to maximize transfer efficiency. To guarantee a higher design quality, CODO performs automatic scheduling to generate high-performance dataflow accelerators, ensuring a balanced performance-resource trade-off. Synthesis results show that CODO delivers $1.45\times$ to $4.52\times$ latency speedups on typical computation kernels and $3.7\times$ to $33.8\times$ speedups on DNN models compared to SOTA frameworks. In on-board evaluations, CODO achieves $7.3\times$ average speedup on CNN models and $2.07\times$ average speedup on the GPT-2 model over SOTA frameworks. The compiler is open-sourced at https://github.com/sjtu-zhao-lab/codo-artifact.