🤖 AI Summary
This work addresses the challenge of performing universal, goal-directed adversarial attacks against closed-source multimodal large language models (MLLMs) in black-box settings. To this end, the authors propose MCRMO-Attack, a novel method that integrates attention-guided multi-crop aggregation, alignment-gated token routing, and a meta-learning optimization framework leveraging cross-target perturbation priors. This approach enables the first universal targeted attack applicable to arbitrary inputs without requiring model access or task-specific tuning. Experimental results demonstrate that MCRMO-Attack significantly outperforms existing universal attack baselines, improving targeted attack success rates on unseen images by 23.7% on GPT-4o and 19.9% on Gemini-2.0, respectively.
📝 Abstract
Targeted adversarial attacks on closed-source multimodal large language models (MLLMs) have been increasingly explored under black-box transfer, yet prior methods are predominantly sample-specific and offer limited reusability across inputs. We instead study a more stringent setting, Universal Targeted Transferable Adversarial Attacks (UTTAA), where a single perturbation must consistently steer arbitrary inputs toward a specified target across unknown commercial MLLMs. Naively adapting existing sample-wise attacks to this universal setting faces three core difficulties: (i) target supervision becomes high-variance due to target-crop randomness, (ii) token-wise matching is unreliable because universality suppresses image-specific cues that would otherwise anchor alignment, and (iii) few-source per-target adaptation is highly initialization-sensitive, which can degrade the attainable performance. In this work, we propose MCRMO-Attack, which stabilizes supervision via Multi-Crop Aggregation with an Attention-Guided Crop, improves token-level reliability through alignability-gated Token Routing, and meta-learns a cross-target perturbation prior that yields stronger per-target solutions. Across commercial MLLMs, we boost unseen-image attack success rate by +23.7\% on GPT-4o and +19.9\% on Gemini-2.0 over the strongest universal baseline.