Why are language models less surprised than humans? Testing the Parse Multiplicity Mismatch Hypothesis

📅 2026-05-14

📈 Citations: 0

✨ Influential: 0

career value

181K/year

🤖 AI Summary

This study investigates the systematic underestimation by language models of cognitive load induced by syntactic ambiguities—such as garden-path sentences—during human reading. By modulating the number of parallel parse trees maintained in word-synchronous beam search within a recurrent neural network grammar (RNNG), the authors simulate surprisal under varying parsing capacities and use these estimates to predict eye-tracking reading times. This approach offers the first computational test of the “parsing multiplicity discrepancy hypothesis.” Results show that reducing the number of parallel parses amplifies the model’s prediction of garden-path effects, yet the magnitude remains substantially weaker than empirically observed human reading times, suggesting that limitations in parsing multiplicity alone cannot account for the mismatch between model surprisal and human cognitive processing difficulty.

📝 Abstract

Surprisal theory posits that the processing difficulty of a word is determined by its predictability in context, offering a potential link between human sentence processing and next-word predictions from language models. While language model (LM) surprisals successfully predict reading times in naturalistic text, they systematically underpredict the magnitude of difficulty observed in controlled studies of syntactic ambiguity, particularly in garden path sentences. This mismatch might arise from differences in the computational constraints between humans and LMs. Here we test one such hypothesis, specifically, that LMs may be able to simultaneously consider a greater number of distinct sentence interpretations at once, compared to humans. Using Recurrent Neural Network Grammars (RNNGs) with word-synchronous beam search, we systematically vary the number of simultaneous parses used to compute word surprisal, and then use these surprisals to predict human reading times. Reducing the number of simultaneous active parses indeed increases the magnitude of predicted garden path effects, but not nearly enough to capture the full magnitude of the effects in humans. This suggests that differences in the number of simultaneous parses available to LMs and humans cannot reconcile LM-based surprisal with human sentence processing.

Problem

Research questions and friction points this paper is trying to address.

language models

surprisal

garden path sentences

syntactic ambiguity

human sentence processing

Innovation

Methods, ideas, or system contributions that make the work stand out.

surprisal

garden path sentences

parse multiplicity