FuseFlow: A Fusion-Centric Compilation Framework for Sparse Deep Learning on Streaming Dataflow

📅 2025-11-06
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address low execution efficiency of sparse deep learning models on reconfigurable dataflow architectures (RDAs), this paper proposes the first fusion-centric compilation framework tailored for PyTorch-based sparse models. Our method automatically transforms sparse computation graphs into fusible dataflow representations, enabling the first general cross-expression sparse fusion and uncovering a new principle: optimal fusion granularity must adapt to model-specific sparsity patterns. We introduce a heuristic design-space pruning strategy to overcome the limitations of exhaustive fusion exploration. Integrated compiler optimizations—including parallelization, sparse tiling, dataflow reordering, and inter-kernel fusion—are evaluated via cycle-accurate microarchitectural co-analysis in a dataflow simulator. Evaluated on four real-world sparse models, including GPT-3 augmented with BigBird, our framework achieves up to 2.7× speedup over state-of-the-art baselines, significantly improving hardware mapping efficiency of sparse DNNs on RDAs.

Technology Category

Application Category

📝 Abstract
As deep learning models scale, sparse computation and specialized dataflow hardware have emerged as powerful solutions to address efficiency. We propose FuseFlow, a compiler that converts sparse machine learning models written in PyTorch to fused sparse dataflow graphs for reconfigurable dataflow architectures (RDAs). FuseFlow is the first compiler to support general cross-expression fusion of sparse operations. In addition to fusion across kernels (expressions), FuseFlow also supports optimizations like parallelization, dataflow ordering, and sparsity blocking. It targets a cycle-accurate dataflow simulator for microarchitectural analysis of fusion strategies. We use FuseFlow for design-space exploration across four real-world machine learning applications with sparsity, showing that full fusion (entire cross-expression fusion across all computation in an end-to-end model) is not always optimal for sparse models-fusion granularity depends on the model itself. FuseFlow also provides a heuristic to identify and prune suboptimal configurations. Using Fuseflow, we achieve performance improvements, including a ~2.7x speedup over an unfused baseline for GPT-3 with BigBird block-sparse attention.
Problem

Research questions and friction points this paper is trying to address.

Compiling sparse PyTorch models to fused dataflow graphs
Enabling cross-expression fusion optimization for sparse operations
Exploring optimal fusion granularity for sparse deep learning models
Innovation

Methods, ideas, or system contributions that make the work stand out.

Compiles PyTorch models to sparse dataflow graphs
Supports cross-expression fusion for sparse operations
Provides heuristic to prune suboptimal fusion configurations
🔎 Similar Papers
No similar papers found.