On the Proper Treatment of Units in Surprisal Theory

📅 2026-04-30

📈 Citations: 0

✨ Influential: 0

career value

166K/year

🤖 AI Summary

This study addresses a persistent source of predictive bias in surprisal theory: the ambiguity in defining linguistic units and their misalignment with tokenization schemes used by language models. To resolve this, the authors propose a unified framework that systematically disentangles unit definition from the selection of prediction regions—the first such formal separation in surprisal analysis. By treating tokenization as an implementation detail rather than a theoretical primitive, the approach integrates surprisal theory, the probabilistic mechanisms of pretrained language models, and formal modeling of linguistic units. This integration establishes a principled alignment between psycholinguistic experiments and computational models, substantially enhancing surprisal’s predictive validity, theoretical rigor, and cross-model comparability.

📝 Abstract

Surprisal theory links human processing effort to the predictability of an upcoming linguistic unit, but empirical work often leaves the notion of a unit underspecified. In practice, experimental stimuli are segmented into linguistically motivated units (e.g., words), while pretrained language models assign probability mass to a fixed token alphabet that typically does not align with those units. As a result, surprisal-based predictors depend implicitly on ad hoc procedures that conflate two distinct modeling choices: the definition of the unit of analysis and the choice of regions of interest over which predictions are evaluated. In this paper, we disentangle these choices and give a unified framework for reasoning about surprisal over arbitrary unit inventories. We argue that surprisal-based analyses should make these choices explicit and treat tokenization as an implementation detail rather than a scientific primitive.

Problem

Research questions and friction points this paper is trying to address.

surprisal theory

linguistic units

tokenization

predictability

language models

Innovation

Methods, ideas, or system contributions that make the work stand out.

surprisal theory

tokenization

linguistic units