🤖 AI Summary
This study challenges the prevailing claim that large language models (LMs) exhibit weaker English comprehension than humans—particularly for low-complexity sentences.
Method: Using a pre-registered behavioral experiment and log-probability analysis under a naturalistic, first-pass-only reading paradigm, we compare human participants with Falcon-180B-Chat, GPT-4, GPT-4o, and Llama-2-70B on grammaticality judgment tasks.
Contribution/Results: Humans achieved only 73% accuracy, whereas GPT-4o attained 100%, and GPT-4 (81%) and Falcon-180B-Chat (76%) significantly outperformed humans. Results reveal shared pragmatic sensitivity between humans and LMs—contradicting assumptions of inherent model deficits—and demonstrate that prompt framing systematically modulates LM alignment with grammatical norms. The study recalibrates evaluation benchmarks for LM linguistic competence and critically interrogates anthropocentric assumptions about language understanding.
📝 Abstract
Recent claims suggest that large language models (LMs) underperform humans in comprehending minimally complex English statements (Dentella et al., 2024). Here, we revisit those findings and argue that human performance was overestimated, while LLM abilities were underestimated. Using the same stimuli, we report a preregistered study comparing human responses in two conditions: one allowed rereading (replicating the original study), and one that restricted rereading (a more naturalistic comprehension test). Human accuracy dropped significantly when rereading was restricted (73%), falling below that of Falcon-180B-Chat (76%) and GPT-4 (81%). The newer GPT-o1 model achieves perfect accuracy. Results further show that both humans and models are disproportionately challenged by queries involving potentially reciprocal actions (e.g., kissing), suggesting shared pragmatic sensitivities rather than model-specific deficits. Additional analyses using Llama-2-70B log probabilities, a recoding of open-ended model responses, and grammaticality ratings of other sentences reveal systematic underestimation of model performance. We find that GPT-4o can align with either naive or expert grammaticality judgments, depending on prompt framing. These findings underscore the need for more careful experimental design and coding practices in LLM evaluation, and they challenge the assumption that current models are inherently weaker than humans at language comprehension.