🤖 AI Summary
Sticker semantic similarity assessment faces challenges including high content diversity, heavy symbolism, and the absence of standardized benchmarks and specialized models. This paper formally defines the sticker semantic similarity task for the first time, introduces Triple-S—the first high-quality, human-annotated triplet benchmark—and proposes the lightweight General Sticker Encoder (GSE). GSE is a Transformer-based architecture trained via multi-stage contrastive learning on Triple-S and additional sticker datasets, enabling robust modeling of symbolic semantics. Experiments demonstrate that GSE yields significantly superior semantic embeddings for unseen stickers compared to general-purpose vision models, achieving state-of-the-art performance on downstream tasks such as emotion classification and cross-domain retrieval. With minimal parameters and efficient inference, GSE serves as a deployable, generalizable foundation model for sticker understanding, accompanied by a rigorous evaluation paradigm.
📝 Abstract
Stickers have become a popular form of visual communication, yet understanding their semantic relationships remains challenging due to their highly diverse and symbolic content. In this work, we formally {define the Sticker Semantic Similarity task} and introduce {Triple-S}, the first benchmark for this task, consisting of 905 human-annotated positive and negative sticker pairs. Through extensive evaluation, we show that existing pretrained vision and multimodal models struggle to capture nuanced sticker semantics. To address this, we propose the {General Sticker Encoder (GSE)}, a lightweight and versatile model that learns robust sticker embeddings using both Triple-S and additional datasets. GSE achieves superior performance on unseen stickers, and demonstrates strong results on downstream tasks such as emotion classification and sticker-to-sticker retrieval. By releasing both Triple-S and GSE, we provide standardized evaluation tools and robust embeddings, enabling future research in sticker understanding, retrieval, and multimodal content generation. The Triple-S benchmark and GSE have been publicly released and are available here.