🤖 AI Summary
Visual Sentiment Analysis (VSA) suffers from insufficient data diversity and poor cross-dataset generalization of models. To address this, we introduce semiotic isotopy—a foundational concept from semiotics—into VSA for the first time, proposing a semantics-driven methodology for constructing large-scale sentiment image datasets. Our approach identifies recurrent semantic combinations of sentiment-salient visual elements via semiotic analysis, then integrates image semantic parsing with structure-aware data augmentation to generate high-quality samples preserving emotional structural consistency. This paradigm shifts beyond conventional statistical data expansion by explicitly modeling the intrinsic logic of emotional expression and enabling controllable, semantics-guided dataset scaling. Evaluated on mainstream VSA benchmarks, models trained on our new dataset achieve substantial improvements in cross-domain generalization, with average accuracy gains of 5.2–8.7% over baselines trained on original datasets, alongside enhanced robustness and stability.
📝 Abstract
Visual Sentiment Analysis (VSA) is a challenging task due to the vast diversity of emotionally salient images and the inherent difficulty of acquiring sufficient data to capture this variability comprehensively. Key obstacles include building large-scale VSA datasets and developing effective methodologies that enable algorithms to identify emotionally significant elements within an image. These challenges are reflected in the limited generalization performance of VSA algorithms and models when trained and tested across different datasets. Starting from a pool of existing data collections, our approach enables the creation of a new larger dataset that not only contains a wider variety of images than the original ones, but also permits training new models with improved capability to focus on emotionally relevant combinations of image elements. This is achieved through the integration of the semiotic isotopy concept within the dataset creation process, providing deeper insights into the emotional content of images. Empirical evaluations show that models trained on a dataset generated with our method consistently outperform those trained on the original data collections, achieving superior generalization across major VSA benchmarks