🤖 AI Summary
Visual sentiment analysis faces dual challenges: limited representational capacity of unimodal models and high computational overhead in dual-model inference. To address these, this paper proposes a lightweight co-architectural framework: it freezes the text encoder of a pre-trained vision-language model (VLM) and transfers emotion discrimination capability from a CNN- or Transformer-based visual model to the VLM’s visual encoder via knowledge distillation. A learnable gating module is further introduced to dynamically fuse predictions from the frozen VLM and the distilled visual pathway. By eliminating parallel dual-model inference, the method significantly reduces computational cost. Extensive experiments demonstrate state-of-the-art performance across multiple mainstream visual sentiment benchmarks, achieving average improvements of 2.3–5.1 percentage points over existing approaches. The source code is publicly available.
📝 Abstract
Visual emotion analysis, which has gained considerable attention in the field of affective computing, aims to predict the dominant emotions conveyed by an image. Despite advancements in visual emotion analysis with the emergence of vision-language models, we observed that instruction-tuned vision-language models and conventional vision models exhibit complementary strengths in visual emotion analysis, as vision-language models excel in certain cases, whereas vision models perform better in others. This finding highlights the need to integrate these capabilities to enhance the performance of visual emotion analysis. To bridge this gap, we propose EmoVLM-KD, an instruction-tuned vision-language model augmented with a lightweight module distilled from conventional vision models. Instead of deploying both models simultaneously, which incurs high computational costs, we transfer the predictive patterns of a conventional vision model into the vision-language model using a knowledge distillation framework. Our approach first fine-tunes a vision-language model on emotion-specific instruction data and then attaches a distilled module to its visual encoder while keeping the vision-language model frozen. Predictions from the vision language model and the distillation module are effectively balanced by a gate module, which subsequently generates the final outcome. Extensive experiments show that EmoVLM-KD achieves state-of-the-art performance on multiple visual emotion analysis benchmark datasets, outperforming the existing methods while maintaining computational efficiency. The code is available in https://github.com/sange1104/EmoVLM-KD.