🤖 AI Summary
This work addresses the challenge of enabling vision-language models (VLMs) to effectively comprehend audiovisual affective expressions in artworks. We propose a novel lightweight audio pretraining paradigm for cross-modal emotion understanding. Our two-stage framework, VAEmotionLLM, first aligns synchronized audio-visual representations via a vision-guided audio alignment mechanism, using images as anchors; second, it introduces a lightweight cross-modal emotion adapter that integrates knowledge distillation, emotion-enhanced residual injection, and explicit emotion supervision to achieve fine-grained audiovisual affective semantic alignment. Crucially, our method activates VLMs’ “auditory understanding” capability with only minimal audio data. Evaluated on the newly constructed ArtEmoBenchmark—a dedicated artistic emotion benchmark—our approach significantly outperforms unimodal and state-of-the-art multimodal baselines. This is the first work to empirically validate the efficacy and complementarity of vision-guided, controllable auditory perception for artistic emotion understanding.
📝 Abstract
Emotion understanding is critical for making Large Language Models (LLMs) more general, reliable, and aligned with humans. Art conveys emotion through the joint design of visual and auditory elements, yet most prior work is human-centered or single-modality, overlooking the emotion intentionally expressed by the artwork. Meanwhile, current Audio-Visual Language Models (AVLMs) typically require large-scale audio pretraining to endow Visual Language Models (VLMs) with hearing, which limits scalability. We present Vision Anchored Audio-Visual Emotion LLM (VAEmotionLLM), a two-stage framework that teaches a VLM to hear by seeing with limited audio pretraining and to understand emotion across modalities. In Stage 1, Vision-Guided Audio Alignment (VG-Align) distills the frozen visual pathway into a new audio pathway by aligning next-token distributions of the shared LLM on synchronized audio-video clips, enabling hearing without a large audio dataset. In Stage 2, a lightweight Cross-Modal Emotion Adapter (EmoAdapter), composed of the Emotion Enhancer and the Emotion Supervisor, injects emotion-sensitive residuals and applies emotion supervision to enhance cross-modal emotion understanding. We also construct ArtEmoBenchmark, an art-centric emotion benchmark that evaluates content and emotion understanding under audio-only, visual-only, and audio-visual inputs. VAEmotionLLM achieves state-of-the-art results on ArtEmoBenchmark, outperforming audio-only, visual-only, and audio-visual baselines. Ablations show that the proposed components are complementary.