Problem
Research questions and friction points this paper is trying to address.
Investigates if data leakage causes poor surprisal predictions in language models
Assesses n-gram overlap between reading corpora and pre-training data
Tests model size impact on surprisal accuracy with leakage-free data
Innovation
Methods, ideas, or system contributions that make the work stand out.
Analyzes data leakage in pre-training datasets
Uses leakage-free data for model training
Replicates inverse scaling effect findings