๐ค AI Summary
This work addresses the systematic underestimation inherent in few-sample discrete entropy estimation, which often arises from neglecting the โmissing massโโthe total probability of unobserved symbols. To mitigate this bias, the authors propose SENECA, a method that models and estimates the missing mass through a self-consistent mechanism grounded in information-theoretic principles. SENECA is both theoretically rigorous and broadly applicable, offering significant improvements over existing approaches. Numerical experiments demonstrate its superior performance in entropy estimation, and its effectiveness is further validated in practical applications such as biodiversity assessment and error detection in large language models.
๐ Abstract
Discrete entropy estimation is a classic information theory problem, wherein the average information content of a discrete random variable is estimated from samples alone. Naive approaches, such as the plugin method, fail to account for the probability mass associated with members of the random variable's support that are unobserved in a given sample, known as the "missing mass." The resulting systemic underestimation is particularly problematic when data is time-consuming or costly to gather. We propose SENECA, an entropy estimation scheme based on a novel ``self-consistent'' missing mass calculation. Extensive numerical experiments indicate that our approach outperforms many state-of-the-art alternatives overall in the small-sample setting. We then apply SENECA to two practical use cases, namely biodiversity estimation and the detection of incorrect large language model responses, where our method is competitive with domain-specific approaches. Our work advances SENECA as an effective drop-in replacement for small-sample entropy estimation, with broad utility across several domains.