HellaSwag-Pro: A Large-Scale Bilingual Benchmark for Evaluating the Robustness of LLMs in Commonsense Reasoning

📅 2025-02-17

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

This work investigates whether large language models (LLMs) possess genuine understanding in commonsense reasoning or merely rely on superficial pattern memorization. Method: We introduce HellaSwag-Pro—the first bilingual, multi-variant robustness benchmark for commonsense reasoning—comprising 11,200 English samples across seven variant types—and a meticulously annotated Chinese counterpart (12,000 samples, 56 fine-grained categories). Our two-stage construction pipeline integrates human verification, rule-based variant generation, bilingual alignment, and granular annotation. We conduct the first systematic robustness evaluation of 41 mainstream LLMs across variant types and languages. Contribution/Results: Experiments reveal pervasive fragility in LLMs’ commonsense reasoning: performance is highly sensitive to linguistic form and language—significant disparities exist between English and Chinese results. The benchmark is publicly released, establishing a reproducible, high-discriminative standard for robustness evaluation.

Technology Category

Application Category

📝 Abstract

Large language models (LLMs) have shown remarkable capabilities in commonsense reasoning; however, some variations in questions can trigger incorrect responses. Do these models truly understand commonsense knowledge, or just memorize expression patterns? To investigate this question, we present the first extensive robustness evaluation of LLMs in commonsense reasoning. We introduce HellaSwag-Pro, a large-scale bilingual benchmark consisting of 11,200 cases, by designing and compiling seven types of question variants. To construct this benchmark, we propose a two-stage method to develop Chinese HellaSwag, a finely annotated dataset comprising 12,000 instances across 56 categories. We conduct extensive experiments on 41 representative LLMs, revealing that these LLMs are far from robust in commonsense reasoning. Furthermore, this robustness varies depending on the language in which the LLM is tested. This work establishes a high-quality evaluation benchmark, with extensive experiments offering valuable insights to the community in commonsense reasoning for LLMs.

Problem

Research questions and friction points this paper is trying to address.

Evaluating LLM robustness in commonsense reasoning

Introducing bilingual benchmark HellaSwag-Pro

Assessing language impact on LLM performance

Innovation

Methods, ideas, or system contributions that make the work stand out.

Bilingual benchmark for LLMs

Two-stage method for dataset development

Extensive evaluation of 41 LLMs

🔎 Similar Papers

No similar papers found.

Authors to Follow