Modal Aphasia: Can Unified Multimodal Models Describe Images From Memory?

📅 2025-10-22
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This paper identifies and names “modality aphasia”—an inherent cross-modal representation disjunction in unified multimodal models: while such models can faithfully reconstruct images (e.g., cinematic artworks), they systematically fail to accurately describe key visual concepts in text. This phenomenon is not attributable to training bias but stems from fundamental architectural limitations. To rigorously characterize it, the authors construct a controllable synthetic dataset and conduct comparative experiments across multiple state-of-the-art multimodal models, providing the first empirical evidence of its universality. Crucially, they demonstrate that text-alignment–based safety alignment strategies alone cannot prevent harmful image generation, exposing a critical cross-modal safety vulnerability. The core contributions are: (1) formalizing modality aphasia as a novel conceptual framework; (2) establishing its structural root causes; and (3) highlighting the inherent limitations of unimodal safety mechanisms—thereby laying theoretical foundations and introducing a new evaluation paradigm for trustworthy multimodal AI. (149 words)

Technology Category

Application Category

📝 Abstract
We present modal aphasia, a systematic dissociation in which current unified multimodal models accurately memorize concepts visually but fail to articulate them in writing, despite being trained on images and text simultaneously. For one, we show that leading frontier models can generate near-perfect reproductions of iconic movie artwork, but confuse crucial details when asked for textual descriptions. We corroborate those findings through controlled experiments on synthetic datasets in multiple architectures. Our experiments confirm that modal aphasia reliably emerges as a fundamental property of current unified multimodal models, not just as a training artifact. In practice, modal aphasia can introduce vulnerabilities in AI safety frameworks, as safeguards applied to one modality may leave harmful concepts accessible in other modalities. We demonstrate this risk by showing how a model aligned solely on text remains capable of generating unsafe images.
Problem

Research questions and friction points this paper is trying to address.

Unified multimodal models fail to articulate visually memorized concepts in writing
Models generate accurate images but confuse details in textual descriptions
Safeguards applied to one modality leave harmful concepts accessible in others
Innovation

Methods, ideas, or system contributions that make the work stand out.

Unified multimodal models memorize visual concepts
Models fail to articulate memorized concepts textually
Safeguards in one modality bypassed in others
🔎 Similar Papers
No similar papers found.