π€ AI Summary
This work addresses the vulnerability of multimodal large language models (MLLMs) in safety-critical scenarios to cross-modal typographic attacksβa threat largely overlooked by existing research, which predominantly focuses on unimodal perturbations and lacks systematic analysis of multimodal synergistic effects. The study introduces and empirically validates a novel cross-modal typographic attack framework that leverages coordinated perturbation generation and composition strategies across audio, visual, and textual modalities. Comprehensive evaluations on state-of-the-art MLLMs and benchmarks spanning commonsense reasoning and content moderation demonstrate that the proposed cross-modal attack achieves a success rate of 83.43%, substantially outperforming unimodal attacks (34.93%). These findings reveal critical fragilities in MLLMs under joint multimodal perturbations and establish a new direction for research in multimodal system security.
π Abstract
As audio-visual multi-modal large language models (MLLMs) are increasingly deployed in safety-critical applications, understanding their vulnerabilities is crucial. To this end, we introduce Multi-Modal Typography, a systematic study examining how typographic attacks across multiple modalities adversely influence MLLMs. While prior work focuses narrowly on unimodal attacks, we expose the cross-modal fragility of MLLMs. We analyze the interactions between audio, visual, and text perturbations and reveal that coordinated multi-modal attack creates a significantly more potent threat than single-modality attacks (attack success rate = $83.43\%$ vs $34.93\%$).Our findings across multiple frontier MLLMs, tasks, and common-sense reasoning and content moderation benchmarks establishes multi-modal typography as a critical and underexplored attack strategy in multi-modal reasoning. Code and data will be publicly available.