🤖 AI Summary
To address low execution efficiency of sparse deep learning models on reconfigurable dataflow architectures (RDAs), this paper proposes the first fusion-centric compilation framework tailored for PyTorch-based sparse models. Our method automatically transforms sparse computation graphs into fusible dataflow representations, enabling the first general cross-expression sparse fusion and uncovering a new principle: optimal fusion granularity must adapt to model-specific sparsity patterns. We introduce a heuristic design-space pruning strategy to overcome the limitations of exhaustive fusion exploration. Integrated compiler optimizations—including parallelization, sparse tiling, dataflow reordering, and inter-kernel fusion—are evaluated via cycle-accurate microarchitectural co-analysis in a dataflow simulator. Evaluated on four real-world sparse models, including GPT-3 augmented with BigBird, our framework achieves up to 2.7× speedup over state-of-the-art baselines, significantly improving hardware mapping efficiency of sparse DNNs on RDAs.
📝 Abstract
As deep learning models scale, sparse computation and specialized dataflow hardware have emerged as powerful solutions to address efficiency. We propose FuseFlow, a compiler that converts sparse machine learning models written in PyTorch to fused sparse dataflow graphs for reconfigurable dataflow architectures (RDAs). FuseFlow is the first compiler to support general cross-expression fusion of sparse operations. In addition to fusion across kernels (expressions), FuseFlow also supports optimizations like parallelization, dataflow ordering, and sparsity blocking. It targets a cycle-accurate dataflow simulator for microarchitectural analysis of fusion strategies. We use FuseFlow for design-space exploration across four real-world machine learning applications with sparsity, showing that full fusion (entire cross-expression fusion across all computation in an end-to-end model) is not always optimal for sparse models-fusion granularity depends on the model itself. FuseFlow also provides a heuristic to identify and prune suboptimal configurations. Using Fuseflow, we achieve performance improvements, including a ~2.7x speedup over an unfused baseline for GPT-3 with BigBird block-sparse attention.