🤖 AI Summary
Existing sticker retrieval methods predominantly rely on contextual generation, overlooking stickers’ capacity as independent semantic units—e.g., direct responses or semantic completions. To address this, we introduce StickerInt, the first open-domain conversational sticker retrieval dataset supporting both “sticker-as-reply” and “sticker-as-semantic-completion” paradigms. We formally define and model the novel task of “sticker as standalone reply” for the first time. Furthermore, we propose Int-RA, a knowledge-enhanced intent prediction and relation-aware cross-modal selection model that jointly encodes dialogue intent and multimodal (text-image) semantics. On StickerInt, Int-RA significantly outperforms state-of-the-art methods. We publicly release the dataset and code to advance sticker retrieval toward more natural, holistic human–machine interaction.
📝 Abstract
Using stickers in online chatting is very prevalent on social media platforms, where the stickers used in the conversation can express someone's intention/emotion/attitude in a vivid, tactful, and intuitive way. Existing sticker retrieval research typically retrieves stickers based on context and the current utterance delivered by the user. That is, the stickers serve as a supplement to the current utterance. However, in the real-world scenario, using stickers to express what we want to say rather than as a supplement to our words only is also important. Therefore, in this paper, we create a new dataset for sticker retrieval in conversation, called extbf{StickerInt}, where stickers are used to reply to previous conversations or supplement our wordsfootnote{We believe that the release of this dataset will provide a more complete paradigm than existing work for the research of sticker retrieval in the open-domain online conversation.}. Based on the created dataset, we present a simple yet effective framework for sticker retrieval in conversation based on the learning of intention and the cross-modal relationships between conversation context and stickers, coined as extbf{Int-RA}. Specifically, we first devise a knowledge-enhanced intention predictor to introduce the intention information into the conversation representations. Subsequently, a relation-aware sticker selector is devised to retrieve the response sticker via cross-modal relationships. Extensive experiments on the created dataset show that the proposed model achieves state-of-the-art performance in sticker retrievalfootnote{The dataset and source code of this work are released at url{https://github.com/HITSZ-HLT/Int-RA}.}.