GSE: Evaluating Sticker Visual Semantic Similarity via a General Sticker Encoder

📅 2025-11-07

📈 Citations: 0

✨ Influential: 0

career value

196K/year

🤖 AI Summary

Sticker semantic similarity assessment faces challenges including high content diversity, heavy symbolism, and the absence of standardized benchmarks and specialized models. This paper formally defines the sticker semantic similarity task for the first time, introduces Triple-S—the first high-quality, human-annotated triplet benchmark—and proposes the lightweight General Sticker Encoder (GSE). GSE is a Transformer-based architecture trained via multi-stage contrastive learning on Triple-S and additional sticker datasets, enabling robust modeling of symbolic semantics. Experiments demonstrate that GSE yields significantly superior semantic embeddings for unseen stickers compared to general-purpose vision models, achieving state-of-the-art performance on downstream tasks such as emotion classification and cross-domain retrieval. With minimal parameters and efficient inference, GSE serves as a deployable, generalizable foundation model for sticker understanding, accompanied by a rigorous evaluation paradigm.

Technology Category

Application Category

📝 Abstract

Stickers have become a popular form of visual communication, yet understanding their semantic relationships remains challenging due to their highly diverse and symbolic content. In this work, we formally {define the Sticker Semantic Similarity task} and introduce {Triple-S}, the first benchmark for this task, consisting of 905 human-annotated positive and negative sticker pairs. Through extensive evaluation, we show that existing pretrained vision and multimodal models struggle to capture nuanced sticker semantics. To address this, we propose the {General Sticker Encoder (GSE)}, a lightweight and versatile model that learns robust sticker embeddings using both Triple-S and additional datasets. GSE achieves superior performance on unseen stickers, and demonstrates strong results on downstream tasks such as emotion classification and sticker-to-sticker retrieval. By releasing both Triple-S and GSE, we provide standardized evaluation tools and robust embeddings, enabling future research in sticker understanding, retrieval, and multimodal content generation. The Triple-S benchmark and GSE have been publicly released and are available here.

Problem

Research questions and friction points this paper is trying to address.

Defining sticker semantic similarity evaluation task

Creating first benchmark dataset for sticker similarity assessment

Developing lightweight model for robust sticker embeddings

Innovation

Methods, ideas, or system contributions that make the work stand out.

Defines Sticker Semantic Similarity evaluation task

Introduces Triple-S benchmark with human annotations

Proposes lightweight General Sticker Encoder model

🔎 Similar Papers

Impact of Stickers on Multimodal Chat Sentiment Analysis and Intent Recognition: A New Task, Dataset and Baseline