🤖 AI Summary
This work uncovers a security vulnerability in state space models (SSMs), such as Mamba, wherein specific trigger phrases induce "partial amnesia" due to irreversible overwriting of hidden states. To investigate this issue, the authors propose Hidden State Poisoning Attacks (HiSPA) and introduce RoBench25 and the Open-Prompt-Injections test set to systematically evaluate the degradation in model retrieval capabilities. The study reveals, for the first time, that SSM-based architectures are highly susceptible to such attacks, whereas pure Transformer models exhibit significantly greater robustness. Through interpretability analyses, potential defensive mechanisms are identified. Experiments demonstrate that a 52B hybrid Jamba model suffers substantial performance degradation under HiSPA, confirming critical security concerns inherent to SSM components.
📝 Abstract
State space models (SSMs) like Mamba offer efficient alternatives to Transformer-based language models, with linear time complexity. Yet, their adversarial robustness remains critically unexplored. This paper studies the phenomenon whereby specific short input phrases induce a partial amnesia effect in such models, by irreversibly overwriting information in their hidden states, referred to as a Hidden State Poisoning Attack (HiSPA). Our benchmark RoBench25 allows evaluating a model's information retrieval capabilities when subject to HiSPAs, and confirms the vulnerability of SSMs against such attacks. Even a recent 52B hybrid SSM-Transformer model from the Jamba family collapses on RoBench25 under optimized HiSPA triggers, whereas pure Transformers do not. We also observe that HiSPA triggers significantly weaken the Jamba model on the popular Open-Prompt-Injections benchmark, unlike pure Transformers. Finally, our interpretability study reveals patterns in Mamba's hidden layers during HiSPAs that could be used to build a HiSPA mitigation system. The full code and data to reproduce the experiments can be found at https://anonymous.4open.science/r/hispa_anonymous-5DB0.