Learning to Hear by Seeing: It's Time for Vision Language Models to Understand Artistic Emotion from Sight and Sound

📅 2025-11-15

📈 Citations: 0

✨ Influential: 0

career value

221K/year

🤖 AI Summary

This work addresses the challenge of enabling vision-language models (VLMs) to effectively comprehend audiovisual affective expressions in artworks. We propose a novel lightweight audio pretraining paradigm for cross-modal emotion understanding. Our two-stage framework, VAEmotionLLM, first aligns synchronized audio-visual representations via a vision-guided audio alignment mechanism, using images as anchors; second, it introduces a lightweight cross-modal emotion adapter that integrates knowledge distillation, emotion-enhanced residual injection, and explicit emotion supervision to achieve fine-grained audiovisual affective semantic alignment. Crucially, our method activates VLMs’ “auditory understanding” capability with only minimal audio data. Evaluated on the newly constructed ArtEmoBenchmark—a dedicated artistic emotion benchmark—our approach significantly outperforms unimodal and state-of-the-art multimodal baselines. This is the first work to empirically validate the efficacy and complementarity of vision-guided, controllable auditory perception for artistic emotion understanding.

Technology Category

Application Category

📝 Abstract

Emotion understanding is critical for making Large Language Models (LLMs) more general, reliable, and aligned with humans. Art conveys emotion through the joint design of visual and auditory elements, yet most prior work is human-centered or single-modality, overlooking the emotion intentionally expressed by the artwork. Meanwhile, current Audio-Visual Language Models (AVLMs) typically require large-scale audio pretraining to endow Visual Language Models (VLMs) with hearing, which limits scalability. We present Vision Anchored Audio-Visual Emotion LLM (VAEmotionLLM), a two-stage framework that teaches a VLM to hear by seeing with limited audio pretraining and to understand emotion across modalities. In Stage 1, Vision-Guided Audio Alignment (VG-Align) distills the frozen visual pathway into a new audio pathway by aligning next-token distributions of the shared LLM on synchronized audio-video clips, enabling hearing without a large audio dataset. In Stage 2, a lightweight Cross-Modal Emotion Adapter (EmoAdapter), composed of the Emotion Enhancer and the Emotion Supervisor, injects emotion-sensitive residuals and applies emotion supervision to enhance cross-modal emotion understanding. We also construct ArtEmoBenchmark, an art-centric emotion benchmark that evaluates content and emotion understanding under audio-only, visual-only, and audio-visual inputs. VAEmotionLLM achieves state-of-the-art results on ArtEmoBenchmark, outperforming audio-only, visual-only, and audio-visual baselines. Ablations show that the proposed components are complementary.

Problem

Research questions and friction points this paper is trying to address.

Teaching vision language models to understand artistic emotion from visual and auditory inputs

Overcoming limitations of single-modality approaches in emotion understanding of artworks

Enabling audio understanding in vision models without large-scale audio pretraining

Innovation

Methods, ideas, or system contributions that make the work stand out.

Vision-guided audio alignment enables hearing without large datasets

Lightweight emotion adapter enhances cross-modal emotion understanding

Two-stage framework integrates vision and audio for artistic emotion analysis

🔎 Similar Papers

Have Large Vision-Language Models Mastered Art History?