🤖 AI Summary
This work addresses the high inference overhead and bandwidth saturation in Mamba2 models caused by expanded state dimensions, which existing pruning methods struggle to mitigate effectively. The authors propose GHOST, a structured pruning framework that introduces balanced truncation—a concept from control theory—into Mamba2 pruning for the first time. By leveraging forward-pass statistics to jointly assess the controllability and observability of hidden states, GHOST enables efficient, gradient-free pruning. The method incorporates output-aware metrics, grouped hidden state selection, and structured sparsity strategies, achieving 50% compression of the state dimension across models ranging from 130M to 2.7B parameters. This results in only a ~1-point increase in WikiText-2 perplexity, nearly matching the accuracy of gradient-based approaches.
📝 Abstract
While Mamba2's expanded state dimension enhances temporal modeling, it incurs substantial inference overhead that saturates bandwidth during autoregressive generation. Standard pruning methods fail to address this bottleneck: unstructured sparsity leaves activations dense, magnitude-based selection ignores runtime dynamics, and gradient-based methods impose prohibitive costs. We introduce GHOST (Grouped Hidden-state Output-aware Selection and Truncation), a structured pruning framework that approximates control-theoretic balanced truncation using only forward-pass statistics. By jointly measuring controllability and observability, GHOST rivals the fidelity of gradient-based methods without requiring backpropagation. As a highlight, on models ranging from 130M to 2.7B parameters, our approach achieves a 50\% state-dimension reduction with approximately 1 perplexity point increase on WikiText-2. Code is available at https://anonymous.4open.science/r/mamba2_ghost-7BCB/.