Hatevolution: What Static Benchmarks Don't Tell Us

📅 2025-06-13

📈 Citations: 0

✨ Influential: 0

career value

195K/year

🤖 AI Summary

Static hate speech benchmarks suffer from temporal fragility due to their neglect of linguistic evolution over time, leading to severely distorted safety evaluations of language models. This paper presents the first systematic empirical investigation of this issue: we construct a dynamic test suite using longitudinal, cross-year data and conduct robustness assessments on 20 state-of-the-art language models, quantifying both performance degradation and bias drift under temporal out-of-distribution evaluation. Results show that all models exhibit an average 18.7% drop in F1 score on temporal extrapolation tasks, confirming that static benchmarks substantially overestimate real-world safety. Building on these findings, we propose principled guidelines for time-sensitive benchmark design—emphasizing data recency, sociocultural context coverage, and mechanisms for continuous updating—thereby establishing a methodological foundation for trustworthy hate speech detection evaluation.

Technology Category

Application Category

📝 Abstract

Language changes over time, including in the hate speech domain, which evolves quickly following social dynamics and cultural shifts. While NLP research has investigated the impact of language evolution on model training and has proposed several solutions for it, its impact on model benchmarking remains under-explored. Yet, hate speech benchmarks play a crucial role to ensure model safety. In this paper, we empirically evaluate the robustness of 20 language models across two evolving hate speech experiments, and we show the temporal misalignment between static and time-sensitive evaluations. Our findings call for time-sensitive linguistic benchmarks in order to correctly and reliably evaluate language models in the hate speech domain.

Problem

Research questions and friction points this paper is trying to address.

Evaluating hate speech models with outdated benchmarks

Impact of language evolution on model benchmarking

Need for time-sensitive hate speech benchmarks

Innovation

Methods, ideas, or system contributions that make the work stand out.

Evaluates 20 models on evolving hate speech

Highlights temporal misalignment in benchmarks

Advocates for time-sensitive linguistic benchmarks

🔎 Similar Papers

Automatically Analyzing Performance Issues in Android Apps: How Far Are We?