🤖 AI Summary
Existing benchmarks inadequately evaluate large language models (LLMs) for detecting extreme partisanship, fake news, harmful tweets, and political bias across multilingual and multitask settings—particularly regarding model scale, prompting strategies, and language coverage.
Method: We conduct the first systematic, unified evaluation of in-context learning (zero-shot, few-shot, chain-of-thought) versus parameter-efficient fine-tuning across ten cross-lingual datasets spanning five languages.
Contribution/Results: Fine-tuning consistently outperforms in-context learning—even for large models such as Llama3.1-8B—and this advantage persists regardless of model scale. Notably, small fine-tuned models surpass larger ones under in-context inference. Our work establishes the general superiority of fine-tuning for content safety detection and introduces the first large-scale, multilingual, multi-paradigm benchmark framework for evaluating LLMs on politically sensitive content moderation.
📝 Abstract
The spread of fake news, polarizing, politically biased, and harmful content on online platforms has been a serious concern. With large language models becoming a promising approach, however, no study has properly benchmarked their performance across different models, usage methods, and languages. This study presents a comprehensive overview of different Large Language Models adaptation paradigms for the detection of hyperpartisan and fake news, harmful tweets, and political bias. Our experiments spanned 10 datasets and 5 different languages (English, Spanish, Portuguese, Arabic and Bulgarian), covering both binary and multiclass classification scenarios. We tested different strategies ranging from parameter efficient Fine-Tuning of language models to a variety of different In-Context Learning strategies and prompts. These included zero-shot prompts, codebooks, few-shot (with both randomly-selected and diversely-selected examples using Determinantal Point Processes), and Chain-of-Thought. We discovered that In-Context Learning often underperforms when compared to Fine-Tuning a model. This main finding highlights the importance of Fine-Tuning even smaller models on task-specific settings even when compared to the largest models evaluated in an In-Context Learning setup - in our case LlaMA3.1-8b-Instruct, Mistral-Nemo-Instruct-2407 and Qwen2.5-7B-Instruct.