🤖 AI Summary
This work proposes 2Mamba, a novel approach that addresses the performance gap between linear attention mechanisms and standard Softmax attention. While linear attention offers computational efficiency, it often suffers from limited expressivity, leading to inferior accuracy. By simplifying the Mamba-2 architecture into Mamba-2S and incorporating an optimized A-mask design alongside higher-order hidden state modeling, 2Mamba significantly narrows this accuracy gap while preserving linear complexity. The method demonstrates superior memory efficiency over Softmax attention in long-context tasks and achieves competitive or even superior accuracy in several settings, effectively reconciling high performance with computational efficiency.
📝 Abstract
Linear attention transformers have become a strong alternative to softmax attention due to their efficiency. However, linear attention tends to be less expressive and results in reduced accuracy compared to softmax attention. To bridge the accuracy gap between softmax attention and linear attention, we manipulate Mamba-2, a very strong linear attention variant. We first simplify Mamba-2 down to its most fundamental and important components, evaluating which specific choices make it most accurate. From this simplified Mamba variant (Mamba-2S), we improve the A-mask and increase the order of the hidden state, resulting in a method, which we call 2Mamba, that is nearly as accurate as softmax attention, yet much more memory efficient for long context lengths. We also investigate elements to Mamba-2 that help surpass softmax attention accuracy. Code is provided for all our experiments