🤖 AI Summary
Current visual emotion analysis methods suffer from limited generalization due to emotion ambiguity and scene diversity. To address this, we propose a general-scenario emotion-aware cross-modal pretraining framework. Methodologically, we are the first to systematically integrate psychological emotion theory—specifically, environment–individual interaction—into contrastive learning and masked image modeling; design a semantic-guided cross-modal knowledge distillation mechanism; and construct Emo8, a large-scale dataset covering eight universal emotions across diverse visual styles. Our model employs dual-path encoding (scene and person), CLIP-based semantic distillation, and cross-modal alignment. Evaluated on six benchmarks across two downstream task categories (emotion classification and regression), it significantly outperforms state-of-the-art methods and demonstrates strong cross-domain generalization—robustly interpreting emotional expressions in cartoons, sci-fi imagery, advertisements, and other heterogeneous visual domains.
📝 Abstract
Visual emotion analysis holds significant research value in both computer vision and psychology. However, existing methods for visual emotion analysis suffer from limited generalizability due to the ambiguity of emotion perception and the diversity of data scenarios. To tackle this issue, we introduce UniEmoX, a cross-modal semantic-guided large-scale pretraining framework. Inspired by psychological research emphasizing the inseparability of the emotional exploration process from the interaction between individuals and their environment, UniEmoX integrates scene-centric and person-centric low-level image spatial structural information, aiming to derive more nuanced and discriminative emotional representations. By exploiting the similarity between paired and unpaired image-text samples, UniEmoX distills rich semantic knowledge from the CLIP model to enhance emotional embedding representations more effectively. To the best of our knowledge, this is the first large-scale pretraining framework that integrates psychological theories with contemporary contrastive learning and masked image modeling techniques for emotion analysis across diverse scenarios. Additionally, we develop a visual emotional dataset titled Emo8. Emo8 samples cover a range of domains, including cartoon, natural, realistic, science fiction and advertising cover styles, covering nearly all common emotional scenes. Comprehensive experiments conducted on six benchmark datasets across two downstream tasks validate the effectiveness of UniEmoX. The source code is available at https://github.com/chincharles/u-emo.