To BEE or not to BEE: Estimating more than Entropy with Biased Entropy Estimators

📅 2025-01-20

📈 Citations: 0

✨ Influential: 0

career value

239K/year

🤖 AI Summary

Accurate estimation of entropy, mutual information, and conditional mutual information in software engineering is often hindered by high computational cost and long runtime. This paper systematically evaluates 18 bias-corrected entropy estimators across varying sample sizes and domain cardinalities. Through large-scale simulations of random joint distributions—complemented by rigorous statistical bias analysis and quantitative convergence assessment—we identify, for the first time, that the Chao–Shen and Chao–Wang–Jost estimators consistently exhibit rapid convergence and strong robustness across all entropy measures. Crucially, they achieve superior accuracy and faster convergence under low-sample-size conditions. Our findings yield a lightweight, reliable, and plug-and-play entropy estimation framework, directly applicable to software confidentiality analysis, test adequacy assessment, and machine learning feature selection.

Technology Category

Application Category

📝 Abstract

Entropy estimation plays a significant role in biology, economics, physics, communication engineering and other disciplines. It is increasingly used in software engineering, e.g. in software confidentiality, software testing, predictive analysis, machine learning, and software improvement. However accurate estimation is demonstrably expensive in many contexts, including software. Statisticians have consequently developed biased estimators that aim to accurately estimate entropy on the basis of a sample. In this paper we apply 18 widely employed entropy estimators to Shannon measures useful to the software engineer: entropy, mutual information and conditional mutual information. Moreover, we investigate how the estimators are affected by two main influential factors: sample size and domain size. Our experiments range over a large set of randomly generated joint probability distributions and varying sample sizes, rather than choosing just one or two well known probability distributions as in previous investigations. Our most important result is identifying that the Chao-Shen and Chao-Wang-Jost estimators stand out for consistently converging more quickly to the ground truth, regardless of domain size and regardless of the measure used. They also tend to outperform the others in terms of accuracy as sample sizes increase. This discovery enables a significant reduction in data collection effort without compromising performance.

Problem

Research questions and friction points this paper is trying to address.

Software Engineering

Entropy Estimation

Information Theory

Innovation

Methods, ideas, or system contributions that make the work stand out.

Chao-Shen and Chao-Wang-Jost estimators

Entropy estimation

Software engineering

🔎 Similar Papers

The Principle of Uncertain Maximum Entropy