DuoGPT: Training-free Dual Sparsity through Activation-aware Pruning in LLMs

📅 2025-06-25

📈 Citations: 0

✨ Influential: 0

career value

220K/year

🤖 AI Summary

Existing LLM deployment approaches treat weight compression and activation sparsity separately, leading to high inference overhead. Method: This paper proposes a training-free dual-sparse inference framework that unifies them by modeling runtime activation sparsity as dynamic structured weight sparsity. It introduces activation-aware calibration and output residual correction to compensate for accuracy loss incurred by unstructured pruning. Built upon the Optimal Brain Compression framework, the method integrates activation-driven pruning, residual correction, and GPU-aware execution optimization. Contribution/Results: Evaluated on LLaMA-2 and LLaMA-3, the framework achieves 1.39× speedup over baseline inference and up to 9.17% higher accuracy than state-of-the-art structured pruning methods, enabling efficient deployment of billion-parameter models.

Technology Category

Application Category

📝 Abstract

Large language models (LLMs) deliver strong performance but are difficult to deploy due to high memory and compute costs. While pruning reduces these demands, most methods ignore activation sparsity observed at runtime. We reinterpret activation sparsity as dynamic structured weight sparsity and propose DuoGPT, a unified framework that constructs dual-sparse (spMspV) workloads by combining unstructured weight pruning with activation sparsity. To preserve accuracy, we extend the Optimal Brain Compression (OBC) framework with activation-aware calibration and introduce output residuals from the dense model as correction terms. We further optimize the solution for efficient GPU execution, enabling scalability to billion-parameter LLMs. Evaluations on LLaMA-2 and LLaMA-3 show that DuoGPT outperforms state-of-the-art structured pruning methods by up to 9.17% accuracy at an iso-speedup of 1.39$ imes$ compared to the baseline dense model.

Problem

Research questions and friction points this paper is trying to address.

Reducing memory and compute costs in LLMs

Combining weight pruning with activation sparsity

Preserving accuracy while optimizing GPU execution

Innovation

Methods, ideas, or system contributions that make the work stand out.

Dynamic structured weight sparsity via activation-aware pruning

Activation-aware calibration with dense model residuals

GPU-optimized execution for billion-parameter scalability

🔎 Similar Papers

No similar papers found.