🤖 AI Summary
Sequence models such as Mamba suffer from excessive attention to irrelevant context, resulting in noisy intermediate representations, weak long-range dependency modeling, poor retrieval capability, and reduced robustness. To address this, we propose the first differential mechanism tailored for selective state space models (SSMs), specifically designed for Mamba: a lightweight differential architecture incorporating gated residuals and context-aware gradient modulation to alleviate representation over-allocation. Unlike prior Transformer-based differential methods, our design overcomes architectural incompatibility by respecting SSM’s structural constraints. Evaluated on multiple language modeling benchmarks, the enhanced model significantly improves long-range dependency capture and information retrieval accuracy, reduces hallucination rates, and consistently outperforms the native Mamba. The implementation is publicly available.
📝 Abstract
Sequence models like Transformers and RNNs often overallocate attention to irrelevant context, leading to noisy intermediate representations. This degrades LLM capabilities by promoting hallucinations, weakening long-range and retrieval abilities, and reducing robustness. Recent work has shown that differential design can mitigate this issue in Transformers, improving their effectiveness across various applications. In this paper, we explore whether these techniques, originally developed for Transformers, can be applied to Mamba, a recent architecture based on selective state-space layers that achieves Transformer-level performance with greater efficiency. We show that a naive adaptation of differential design to Mamba is insufficient and requires careful architectural modifications. To address this, we introduce a novel differential mechanism for Mamba, empirically validated on language modeling benchmarks, demonstrating improved retrieval capabilities and superior performance over vanilla Mamba. Finally, we conduct extensive ablation studies and empirical analyses to justify our design choices and provide evidence that our approach effectively mitigates the overallocation problem in Mamba-based models. Our code is publicly available.