🤖 AI Summary
Mamba suffers from weak modeling capability for long-sequence natural language tasks, and its inference efficiency advantage fails to translate into performance gains. To address this, we propose a two-stage re-forward framework that innovatively integrates selective state compression with intra-layer dynamic adaptation—breaking Mamba’s long-range dependency modeling bottleneck without significant computational overhead. Our approach preserves the core structured state-space modeling paradigm while enhancing context awareness via a lightweight re-computation scheme. Evaluated on LongBench and L-Eval, it achieves absolute improvements of +3.2 and +1.6 points, respectively, approaching the performance of Transformer models with comparable parameter counts. Crucially, inference latency increases by less than 5%, marking the first instance where Mamba achieves substantive performance parity with Transformers on long-text understanding tasks.
📝 Abstract
While the Mamba architecture demonstrates superior inference efficiency and competitive performance on short-context natural language processing (NLP) tasks, empirical evidence suggests its capacity to comprehend long contexts is limited compared to transformer-based models. In this study, we investigate the long-context efficiency issues of the Mamba models and propose ReMamba, which enhances Mamba's ability to comprehend long contexts. ReMamba incorporates selective compression and adaptation techniques within a two-stage re-forward process, incurring minimal additional inference costs overhead. Experimental results on the LongBench and L-Eval benchmarks demonstrate ReMamba's efficacy, improving over the baselines by 3.2 and 1.6 points, respectively, and attaining performance almost on par with same-size transformer models.