To BEE or not to BEE: Estimating more than Entropy with Biased Entropy Estimators

📅 2025-01-20
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Accurate estimation of entropy, mutual information, and conditional mutual information in software engineering is often hindered by high computational cost and long runtime. This paper systematically evaluates 18 bias-corrected entropy estimators across varying sample sizes and domain cardinalities. Through large-scale simulations of random joint distributions—complemented by rigorous statistical bias analysis and quantitative convergence assessment—we identify, for the first time, that the Chao–Shen and Chao–Wang–Jost estimators consistently exhibit rapid convergence and strong robustness across all entropy measures. Crucially, they achieve superior accuracy and faster convergence under low-sample-size conditions. Our findings yield a lightweight, reliable, and plug-and-play entropy estimation framework, directly applicable to software confidentiality analysis, test adequacy assessment, and machine learning feature selection.

Technology Category

Application Category

📝 Abstract
Entropy estimation plays a significant role in biology, economics, physics, communication engineering and other disciplines. It is increasingly used in software engineering, e.g. in software confidentiality, software testing, predictive analysis, machine learning, and software improvement. However accurate estimation is demonstrably expensive in many contexts, including software. Statisticians have consequently developed biased estimators that aim to accurately estimate entropy on the basis of a sample. In this paper we apply 18 widely employed entropy estimators to Shannon measures useful to the software engineer: entropy, mutual information and conditional mutual information. Moreover, we investigate how the estimators are affected by two main influential factors: sample size and domain size. Our experiments range over a large set of randomly generated joint probability distributions and varying sample sizes, rather than choosing just one or two well known probability distributions as in previous investigations. Our most important result is identifying that the Chao-Shen and Chao-Wang-Jost estimators stand out for consistently converging more quickly to the ground truth, regardless of domain size and regardless of the measure used. They also tend to outperform the others in terms of accuracy as sample sizes increase. This discovery enables a significant reduction in data collection effort without compromising performance.
Problem

Research questions and friction points this paper is trying to address.

Software Engineering
Entropy Estimation
Information Theory
Innovation

Methods, ideas, or system contributions that make the work stand out.

Chao-Shen and Chao-Wang-Jost estimators
Entropy estimation
Software engineering
🔎 Similar Papers
No similar papers found.
I
Ilaria Pia la Torre
University College London, UK
David A. Kelly
David A. Kelly
King's College London
Information TheoryCausalityExplainable AISoftware Engineering
H
Hector D. Menendez
King’s College London, UK
D
David Clark
University College London, UK