The Inverse Scaling Effect of Pre-Trained Language Model Surprisal Is Not Due to Data Leakage

📅 2025-06-01

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

In psycholinguistics, larger pretrained language models (PLMs) exhibit diminished predictiveness of human natural reading times via surprisal—a phenomenon termed “inverse scaling.” Prior work hypothesized this stems from training data leakage into reading corpora. Method: We conduct two large-scale empirical studies: (1) cross-corpus n-gram overlap analysis to quantify leakage of five canonical reading corpora into major PLM training datasets; and (2) training of leakage-minimized controlled models to assess surprisal–reading-time correlations. Results: (1) Empirical leakage is negligible across all corpora; (2) inverse scaling persists robustly even in near-zero-leakage models. These findings falsify the data-leakage hypothesis and establish, for the first time, that the effect arises from intrinsic limitations in PLMs’ cognitive modeling capacity—not artifact contamination. This provides critical theoretical clarification and a methodological benchmark for psycholinguistic model evaluation.

Technology Category

Application Category

📝 Abstract

In psycholinguistic modeling, surprisal from larger pre-trained language models has been shown to be a poorer predictor of naturalistic human reading times. However, it has been speculated that this may be due to data leakage that caused language models to see the text stimuli during training. This paper presents two studies to address this concern at scale. The first study reveals relatively little leakage of five naturalistic reading time corpora in two pre-training datasets in terms of length and frequency of token $n$-gram overlap. The second study replicates the negative relationship between language model size and the fit of surprisal to reading times using models trained on 'leakage-free' data that overlaps only minimally with the reading time corpora. Taken together, this suggests that previous results using language models trained on these corpora are not driven by the effects of data leakage.

Problem

Research questions and friction points this paper is trying to address.

Investigates if data leakage causes poor surprisal predictions in language models

Assesses n-gram overlap between reading corpora and pre-training data

Tests model size impact on surprisal accuracy with leakage-free data

Innovation

Methods, ideas, or system contributions that make the work stand out.

Analyzes data leakage in pre-training datasets

Uses leakage-free data for model training

Replicates inverse scaling effect findings

🔎 Similar Papers

No similar papers found.

Authors to Follow