Are LLMs Enough for Hyperpartisan, Fake, Polarized and Harmful Content Detection? Evaluating In-Context Learning vs. Fine-Tuning

📅 2025-09-09

📈 Citations: 0

✨ Influential: 0

career value

184K/year

🤖 AI Summary

Existing benchmarks inadequately evaluate large language models (LLMs) for detecting extreme partisanship, fake news, harmful tweets, and political bias across multilingual and multitask settings—particularly regarding model scale, prompting strategies, and language coverage. Method: We conduct the first systematic, unified evaluation of in-context learning (zero-shot, few-shot, chain-of-thought) versus parameter-efficient fine-tuning across ten cross-lingual datasets spanning five languages. Contribution/Results: Fine-tuning consistently outperforms in-context learning—even for large models such as Llama3.1-8B—and this advantage persists regardless of model scale. Notably, small fine-tuned models surpass larger ones under in-context inference. Our work establishes the general superiority of fine-tuning for content safety detection and introduces the first large-scale, multilingual, multi-paradigm benchmark framework for evaluating LLMs on politically sensitive content moderation.

Technology Category

Application Category

📝 Abstract

The spread of fake news, polarizing, politically biased, and harmful content on online platforms has been a serious concern. With large language models becoming a promising approach, however, no study has properly benchmarked their performance across different models, usage methods, and languages. This study presents a comprehensive overview of different Large Language Models adaptation paradigms for the detection of hyperpartisan and fake news, harmful tweets, and political bias. Our experiments spanned 10 datasets and 5 different languages (English, Spanish, Portuguese, Arabic and Bulgarian), covering both binary and multiclass classification scenarios. We tested different strategies ranging from parameter efficient Fine-Tuning of language models to a variety of different In-Context Learning strategies and prompts. These included zero-shot prompts, codebooks, few-shot (with both randomly-selected and diversely-selected examples using Determinantal Point Processes), and Chain-of-Thought. We discovered that In-Context Learning often underperforms when compared to Fine-Tuning a model. This main finding highlights the importance of Fine-Tuning even smaller models on task-specific settings even when compared to the largest models evaluated in an In-Context Learning setup - in our case LlaMA3.1-8b-Instruct, Mistral-Nemo-Instruct-2407 and Qwen2.5-7B-Instruct.

Problem

Research questions and friction points this paper is trying to address.

Evaluating LLM performance for detecting harmful online content types

Comparing fine-tuning versus in-context learning across multiple languages

Benchmarking detection methods for fake news and polarized content

Innovation

Methods, ideas, or system contributions that make the work stand out.

Comprehensive benchmarking across models and languages

Comparing fine-tuning versus diverse in-context learning strategies

Demonstrating fine-tuning superiority over in-context learning approaches

🔎 Similar Papers

PropaInsight: Toward Deeper Understanding of Propaganda in Terms of Techniques, Appeals, and Intent