What the HellaSwag? On the Validity of Common-Sense Reasoning Benchmarks

📅 2025-04-10

📈 Citations: 0

✨ Influential: 0

career value

169K/year

🤖 AI Summary

This paper identifies severe construct validity deficiencies in the HellaSwag benchmark—including grammatical errors, misleading prompts, ambiguous answer options, and spurious statistical shortcuts—that undermine its reliability for evaluating language models’ commonsense reasoning. Through ablation studies across multiple LLM scales, answer-text isolation tests, controlled “Lorem ipsum” prompt baselines, and fine-grained human annotation, we quantitatively demonstrate for the first time that over 65% of model predictions are context-agnostic, rendering evaluations highly susceptible to superficial surface patterns. We propose six essential design principles for next-generation commonsense reasoning benchmarks and open-source GoldenSwag, a rigorously curated subset addressing these flaws. Experiments show that GoldenSwag substantially improves both reliability (inter-annotator agreement) and validity (correlation with human judgment and robustness to distractors), enabling more trustworthy model selection and capability attribution.

Technology Category

Application Category

📝 Abstract

Common-sense reasoning is a key language model capability because it encapsulates not just specific factual knowledge but rather general language and world understanding. Measuring common-sense reasoning, therefore, is crucial for language models of different sizes and applications. One of the most widely used benchmarks for evaluating such capabilities is HellaSwag; however, in this paper, we show that it has severe construct validity issues. These issues range from basic ungrammaticality and numerous typos to misleading prompts or equally correct options. Furthermore, we show that if models are evaluated only on answer texts, or with"Lorem ipsum dolor..."instead of the question, more than 65% of model predictions remain the same, and this cannot be attributed merely to contamination. Since benchmark scores are an essential part of model selection in both research and commercial applications, these validity issues can have severe consequences. In particular, knowing that taking benchmark scores at face value is ubiquitous, inadequate evaluation leads to ill-informed decisions about models. In this paper, we thoroughly investigate critical validity issues posed by HellaSwag and illustrate them with various evaluations using generative language models of different sizes. We argue that this benchmark does not accurately measure common-sense reasoning and, therefore, should not be used for evaluation in its current state. Based on the results of our study, we propose requirements that should be met by future common-sense reasoning benchmarks. In addition, we release GoldenSwag, a corrected subset of HellaSwag, which, to our belief, facilitates acceptable common-sense reasoning evaluation.

Problem

Research questions and friction points this paper is trying to address.

HellaSwag benchmark has severe construct validity issues

Benchmark scores mislead model selection decisions

Current benchmark fails to measure common-sense reasoning accurately

Innovation

Methods, ideas, or system contributions that make the work stand out.

Exposed HellaSwag benchmark validity issues

Proposed requirements for future benchmarks

Released corrected subset GoldenSwag

🔎 Similar Papers

No similar papers found.