🤖 AI Summary
Language models (LMs) exhibit systematic biases in linguistic acceptability judgments—particularly sensitivity to sequence length and token frequency—that diverge markedly from human perception, necessitating data-driven calibration. Method: We propose MORCELA, a novel linking theory that learns dynamic, data-derived calibration parameters for length and frequency effects. Using Pythia and OPT Transformer models, we employ regression modeling and cross-scale comparative analysis to quantify differential sensitivity to these two confounds. Contribution/Results: MORCELA significantly outperforms the SLOR baseline across multiple models; SLOR suffers from overcorrection. While larger models show increased robustness to frequency bias—attributable to improved contextual prediction of rare tokens—frequency calibration remains essential. MORCELA establishes an interpretable, generalizable evaluation paradigm for aligning LM behavior with human linguistic cognition.
📝 Abstract
When comparing the linguistic capabilities of language models (LMs) with humans using LM probabilities, factors such as the length of the sequence and the unigram frequency of lexical items have a significant effect on LM probabilities in ways that humans are largely robust to. Prior works in comparing LM and human acceptability judgments treat these effects uniformly across models, making a strong assumption that models require the same degree of adjustment to control for length and unigram frequency effects. We propose MORCELA, a new linking theory between LM scores and acceptability judgments where the optimal level of adjustment for these effects is estimated from data via learned parameters for length and unigram frequency. We first show that MORCELA outperforms a commonly used linking theory for acceptability - SLOR (Pauls and Klein, 2012; Lau et al. 2017) - across two families of transformer LMs (Pythia and OPT). Furthermore, we demonstrate that the assumed degrees of adjustment in SLOR for length and unigram frequency overcorrect for these confounds, and that larger models require a lower relative degree of adjustment for unigram frequency, though a significant amount of adjustment is still necessary for all models. Finally, our subsequent analysis shows that larger LMs' lower susceptibility to frequency effects can be explained by an ability to better predict rarer words in context.