🤖 AI Summary
This study addresses the automatic identification of fine-grained, positive supportive language—termed “candy speech”—in social media. We propose a span-level fine-tuning framework for multilingual modeling, trained on 46k German YouTube comments. Our approach integrates representations from XLM-RoBERTa-Large, GBERT, and Qwen3 Embedding, and employs an emoji-aware tokenizer to enhance affective and pragmatic modeling. Unlike conventional sentence-level classification, span-level training improves precise localization of supportive linguistic segments. Evaluated on the GermEval 2025 shared task, our system achieved first place: a binary positive F1-score of 0.8906 and a strict span-level F1-score of 0.6307 for categorized supportive spans. These results demonstrate the effectiveness of cross-lingual representation learning combined with fine-grained span annotation for analyzing civil online discourse.
📝 Abstract
Positive, supportive online communication in social media (candy speech) has the potential to foster civility, yet automated detection of such language remains underexplored, limiting systematic analysis of its impact. We investigate how candy speech can be reliably detected in a 46k-comment German YouTube corpus by monolingual and multilingual language models, including GBERT, Qwen3 Embedding, and XLM-RoBERTa. We find that a multilingual XLM-RoBERTa-Large model trained to detect candy speech at the span level outperforms other approaches, ranking first in both binary positive F1: 0.8906) and categorized span-based detection (strict F1: 0.6307) subtasks at the GermEval 2025 Shared Task on Candy Speech Detection. We speculate that span-based training, multilingual capabilities, and emoji-aware tokenizers improved detection performance. Our results demonstrate the effectiveness of multilingual models in identifying positive, supportive language.