Mamba-Shedder: Post-Transformer Compression for Efficient Selective Structured State Space Models

πŸ“… 2025-01-28
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
Selective State Space Models (SSMs), such as Mamba, suffer from high computational overhead and memory footprint during deployment, hindering their practical adoption. Method: This paper introduces Mamba-Shedderβ€”the first hardware-aware, multi-granularity compression framework tailored for SSMs. Leveraging systematic structural sensitivity analysis, it integrates module- and parameter-level collaborative pruning, structured sparsification, inference graph rewriting, and hardware-coordinated optimization to eliminate redundancy at fine granularity. Contribution/Results: Under <0.5% accuracy degradation, Mamba-Shedder achieves up to 1.4Γ— inference speedup, significantly reducing latency and GPU memory consumption. The implementation is open-sourced, establishing a reproducible compression paradigm for efficient sequence modeling in the post-Transformer era.

Technology Category

Application Category

πŸ“ Abstract
Large pre-trained models have achieved outstanding results in sequence modeling. The Transformer block and its attention mechanism have been the main drivers of the success of these models. Recently, alternative architectures, such as Selective Structured State Space Models (SSMs), have been proposed to address the inefficiencies of Transformers. This paper explores the compression of SSM-based models, particularly Mamba and its hybrids. We study the sensitivity of these models to the removal of selected components at different granularities to reduce the model size and computational overhead, thus improving their efficiency while maintaining accuracy. The proposed solutions, collectively referred to as Mamba-Shedder, achieve a speedup of up to 1.4x during inference, demonstrating that model efficiency can be improved by eliminating several redundancies with minimal impact on the overall model performance. The code is available at https://github.com/IntelLabs/Hardware-Aware-Automated-Machine-Learning.
Problem

Research questions and friction points this paper is trying to address.

Pretrained Models
Sequence State Models (SSMs)
Efficiency Optimization
Innovation

Methods, ideas, or system contributions that make the work stand out.

Mamba-Shedder
SSM Model Optimization
Efficient Sequence Data Processing
πŸ”Ž Similar Papers
No similar papers found.