๐ค AI Summary
Prior work commonly approximates word-level contextual entropy using first-subword token entropy, but this approach systematically underestimates true word entropy and distorts its psycholinguistic predictive validity. Method: We propose a variable-length subword entropy estimator based on Monte Carlo sampling to accurately model the full-word probability distribution conditioned on context. Contribution/Results: Regression analyses with eye-tracking reading time data reveal a significant divergence in predictive power between first-subword entropy and our Monte Carloโestimated word entropy: the latter exhibits superior explanatory power and captures entropy variation missed by the former. This study is the first to systematically identify the theoretical limitations of the first-subword approximation, establishing a more rigorous and scalable methodological foundation for constructing entropy metrics from language models in psycholinguistics.
๐ Abstract
Contextual entropy is a psycholinguistic measure capturing the anticipated difficulty of processing a word just before it is encountered. Recent studies have tested for entropy-related effects as a potential complement to well-known effects from surprisal. For convenience, entropy is typically estimated based on a language model's probability distribution over a word's first subword token. However, this approximation results in underestimation and potential distortion of true word entropy. To address this, we generate Monte Carlo (MC) estimates of word entropy that allow words to span a variable number of tokens. Regression experiments on reading times show divergent results between first-token and MC word entropy, suggesting a need for caution in using first-token approximations of contextual entropy.