SENECA: Small-Sample Discrete Entropy Estimation via Self-Consistent Missing Mass

๐Ÿ“… 2026-05-01
๐Ÿ“ˆ Citations: 0
โœจ Influential: 0
๐Ÿ“„ PDF

career value

220K/year
๐Ÿค– AI Summary
This work addresses the systematic underestimation inherent in few-sample discrete entropy estimation, which often arises from neglecting the โ€œmissing massโ€โ€”the total probability of unobserved symbols. To mitigate this bias, the authors propose SENECA, a method that models and estimates the missing mass through a self-consistent mechanism grounded in information-theoretic principles. SENECA is both theoretically rigorous and broadly applicable, offering significant improvements over existing approaches. Numerical experiments demonstrate its superior performance in entropy estimation, and its effectiveness is further validated in practical applications such as biodiversity assessment and error detection in large language models.
๐Ÿ“ Abstract
Discrete entropy estimation is a classic information theory problem, wherein the average information content of a discrete random variable is estimated from samples alone. Naive approaches, such as the plugin method, fail to account for the probability mass associated with members of the random variable's support that are unobserved in a given sample, known as the "missing mass." The resulting systemic underestimation is particularly problematic when data is time-consuming or costly to gather. We propose SENECA, an entropy estimation scheme based on a novel ``self-consistent'' missing mass calculation. Extensive numerical experiments indicate that our approach outperforms many state-of-the-art alternatives overall in the small-sample setting. We then apply SENECA to two practical use cases, namely biodiversity estimation and the detection of incorrect large language model responses, where our method is competitive with domain-specific approaches. Our work advances SENECA as an effective drop-in replacement for small-sample entropy estimation, with broad utility across several domains.
Problem

Research questions and friction points this paper is trying to address.

discrete entropy estimation
small-sample
missing mass
information theory
systematic underestimation
Innovation

Methods, ideas, or system contributions that make the work stand out.

entropy estimation
missing mass
small-sample
self-consistent
discrete distribution
๐Ÿ”Ž Similar Papers
No similar papers found.