Efficient Hybrid Language Model Compression through Group-Aware SSM Pruning

📅 2025-04-15
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Efficiently compressing Attention+SSM hybrid large language models remains challenging due to the intricate interplay between attention and state-space modeling (SSM) components. Method: This paper proposes a group-aware SSM structured pruning technique—the first systematic investigation into the pivotal role of SSM modules in hybrid-architecture compression—combined with multi-granularity pruning across feed-forward networks (FFNs), embedding layers, and Transformer blocks, and MINITRON-style knowledge distillation–based retraining. Contribution/Results: The method preserves sequence modeling capability and architectural integrity while compressing Nemotron-H 8B to 4B parameters, reducing training tokens by 40× and doubling inference speed. Crucially, it achieves higher accuracy than same-scale baselines, significantly advancing the accuracy–latency Pareto frontier.

Technology Category

Application Category

📝 Abstract
Hybrid LLM architectures that combine Attention and State Space Models (SSMs) achieve state-of-the-art accuracy and runtime performance. Recent work has demonstrated that applying compression and distillation to Attention-only models yields smaller, more accurate models at a fraction of the training cost. In this work, we explore the effectiveness of compressing Hybrid architectures. We introduce a novel group-aware pruning strategy that preserves the structural integrity of SSM blocks and their sequence modeling capabilities. Furthermore, we demonstrate the necessity of such SSM pruning to achieve improved accuracy and inference speed compared to traditional approaches. Our compression recipe combines SSM, FFN, embedding dimension, and layer pruning, followed by knowledge distillation-based retraining, similar to the MINITRON technique. Using this approach, we compress the Nemotron-H 8B Hybrid model down to 4B parameters with up to 40x fewer training tokens. The resulting model surpasses the accuracy of similarly-sized models while achieving 2x faster inference, significantly advancing the Pareto frontier.
Problem

Research questions and friction points this paper is trying to address.

Compressing Hybrid Attention-SSM LLMs efficiently
Preserving SSM block integrity during pruning
Achieving faster inference with higher accuracy
Innovation

Methods, ideas, or system contributions that make the work stand out.

Group-aware pruning for SSM structural integrity
Combined SSM, FFN, embedding, and layer pruning
Knowledge distillation-based retraining for efficiency
🔎 Similar Papers
No similar papers found.
Ali Taghibakhshi
Ali Taghibakhshi
Deep Learning Algorithm Engineer, NVIDIA
Scientific ComputingMachine LearningGraph Neural NetworksReinforcement Learning
S
Sharath Turuvekere Sreenivas
Saurav Muralidharan
Saurav Muralidharan
NVIDIA
Efficient Deep LearningLarge Language Models
Marcin Chochowski
Marcin Chochowski
NVIDIA, previously Samsung R&D Poland
NLPDeep learningbiometrics
Y
Yashaswi Karnati
Raviraj Joshi
Raviraj Joshi
Indian Institute of Technology Madras
computer sciencemachine learningnatural language processing
Ameya Sunil Mahabaleshwarkar
Ameya Sunil Mahabaleshwarkar
Deep Learning Scientist, NVIDIA
Deep LearningNatural Language ProcessingLarge Language ModelsSmall Language Models
Zijia Chen
Zijia Chen
Senior Deep Learning Scientist, NVIDIA Corporation
Natural Language ProcessingArtificial IntelligenceMultimodal Model
Yoshi Suhara
Yoshi Suhara
NVIDIA
Natural Language ProcessingMachine LearningComputational Social Science
O
Oluwatobi Olabiyi
Daniel Korzekwa
Daniel Korzekwa
Nvidia
PruningDistillationLLMVLMSpeech
Mostofa Patwary
Mostofa Patwary
Director, Applied Deep Learning Research, NVIDIA
Natural Language ProcessingLarge Scale Deep LearningHigh Performance ComputingParallel
Mohammad Shoeybi
Mohammad Shoeybi
Senior Director of Applied Research at NVIDIA
Large Language ModelsNLPMulti-Modal ModelsGenerative AI
Jan Kautz
Jan Kautz
Vice President of Research, NVIDIA Research
Computer VisionMachine LearningVisual Computing
Bryan Catanzaro
Bryan Catanzaro
NVIDIA
Parallel ComputingMachine Learning
A
Ashwath Aithal
Nima Tajbakhsh
Nima Tajbakhsh
Nvidia Inc.
Computer vision and Artificial Intelligence
Pavlo Molchanov
Pavlo Molchanov
NVIDIA Research
AIMachine LearningEfficient Deep LearningSemi-supervised learningnetwork inversion