A suite of LMs comprehend puzzle statements as well as humans

📅 2025-05-13

📈 Citations: 0

✨ Influential: 0

career value

187K/year

🤖 AI Summary

This study challenges the prevailing claim that large language models (LMs) exhibit weaker English comprehension than humans—particularly for low-complexity sentences. Method: Using a pre-registered behavioral experiment and log-probability analysis under a naturalistic, first-pass-only reading paradigm, we compare human participants with Falcon-180B-Chat, GPT-4, GPT-4o, and Llama-2-70B on grammaticality judgment tasks. Contribution/Results: Humans achieved only 73% accuracy, whereas GPT-4o attained 100%, and GPT-4 (81%) and Falcon-180B-Chat (76%) significantly outperformed humans. Results reveal shared pragmatic sensitivity between humans and LMs—contradicting assumptions of inherent model deficits—and demonstrate that prompt framing systematically modulates LM alignment with grammatical norms. The study recalibrates evaluation benchmarks for LM linguistic competence and critically interrogates anthropocentric assumptions about language understanding.

Technology Category

Application Category

📝 Abstract

Recent claims suggest that large language models (LMs) underperform humans in comprehending minimally complex English statements (Dentella et al., 2024). Here, we revisit those findings and argue that human performance was overestimated, while LLM abilities were underestimated. Using the same stimuli, we report a preregistered study comparing human responses in two conditions: one allowed rereading (replicating the original study), and one that restricted rereading (a more naturalistic comprehension test). Human accuracy dropped significantly when rereading was restricted (73%), falling below that of Falcon-180B-Chat (76%) and GPT-4 (81%). The newer GPT-o1 model achieves perfect accuracy. Results further show that both humans and models are disproportionately challenged by queries involving potentially reciprocal actions (e.g., kissing), suggesting shared pragmatic sensitivities rather than model-specific deficits. Additional analyses using Llama-2-70B log probabilities, a recoding of open-ended model responses, and grammaticality ratings of other sentences reveal systematic underestimation of model performance. We find that GPT-4o can align with either naive or expert grammaticality judgments, depending on prompt framing. These findings underscore the need for more careful experimental design and coding practices in LLM evaluation, and they challenge the assumption that current models are inherently weaker than humans at language comprehension.

Problem

Research questions and friction points this paper is trying to address.

Assessing LM vs human comprehension of complex statements

Evaluating impact of rereading on human puzzle accuracy

Identifying shared pragmatic challenges in reciprocal queries

Innovation

Methods, ideas, or system contributions that make the work stand out.

Preregistered study compares human and LM comprehension

GPT-4o aligns with naive or expert judgments

Recoding reveals systematic underestimation of LM performance

🔎 Similar Papers

Are LLMs Good Cryptic Crossword Solvers?