π€ AI Summary
To address the insufficient co-optimization of compression formats and dataflows in sparse large language model (LLM) accelerator design, this paper proposes SnipSnapβa unified optimization framework. Its key contributions are: (1) a hierarchical compression format encoding scheme enabling fine-grained format modeling; (2) an adaptive compression engine that dynamically matches compression formats to input sparsity patterns; and (3) a progressive co-search methodology that jointly optimizes compression formats and dataflows within a unified design space. Experimental evaluation demonstrates that SnipSnap reduces memory energy consumption by 18.24% on average and achieves inference speedups of 2248.3Γ and 21.0Γ over Sparseloop and DiMO-Sparse, respectively. These improvements significantly enhance the energy efficiency and throughput of sparse LLM accelerators.
π Abstract
The growing scale of large language models (LLMs) has intensified demands on computation and memory, making efficient inference a key challenge. While sparsity can reduce these costs, existing design space exploration (DSE) frameworks often overlook compression formats, a key factor for leveraging sparsity on accelerators. This paper proposes SnipSnap, a joint compression format and dataflow co-optimization framework for efficient sparse LLM accelerator design. SnipSnap introduces: (1) a hierarchical compression format encoding to expand the design space; (2) an adaptive compression engine for selecting formats under diverse sparsity; and (3) a progressive co-search workflow that jointly optimizes dataflow and compression formats. SnipSnap achieves 18.24% average memory energy savings via format optimization, along with 2248.3$ imes$ and 21.0$ imes$ speedups over Sparseloop and DiMO-Sparse frameworks, respectively.