🤖 AI Summary
Traditional adversarial attacks target only single-step decisions, making them ineffective at inducing cascading failures across multi-step reasoning chains. This work introduces the novel threat of “decision chain hijacking”: a single perturbation that simultaneously manipulates multiple downstream outputs of multimodal large language models (MLLMs)—e.g., misclassifying “bicycle lane” as “motor vehicle lane” while misidentifying “pedestrian” as “plastic bag.” To address this, we propose Semantic-Aware Universal Perturbations (SAUPs), integrating semantic guidance, normalized spatial search, and target-decoupling optimization to enable synchronized control over five distinct output categories via a single-frame perturbation. Evaluated on our newly constructed real-world dataset RIST, SAUPs achieve an average attack success rate of 70% across three state-of-the-art MLLMs. Our results expose a previously unrecognized systemic security vulnerability in MLLMs—namely, their susceptibility to targeted corruption of extended, interdependent reasoning chains.
📝 Abstract
Conventional adversarial attacks focus on manipulating a single decision of neural networks. However, real-world models often operate in a sequence of decisions, where an isolated mistake can be easily corrected, but cascading errors can lead to severe risks.
This paper reveals a novel threat: a single perturbation can hijack the whole decision chain. We demonstrate the feasibility of manipulating a model's outputs toward multiple, predefined outcomes, such as simultaneously misclassifying "non-motorized lane" signs as "motorized lane" and "pedestrian" as "plastic bag".
To expose this threat, we introduce Semantic-Aware Universal Perturbations (SAUPs), which induce varied outcomes based on the semantics of the inputs. We overcome optimization challenges by developing an effective algorithm, which searches for perturbations in normalized space with a semantic separation strategy. To evaluate the practical threat of SAUPs, we present RIST, a new real-world image dataset with fine-grained semantic annotations. Extensive experiments on three multimodal large language models demonstrate their vulnerability, achieving a 70% attack success rate when controlling five distinct targets using just an adversarial frame.