🤖 AI Summary
This study addresses the lack of a systematic information-theoretic understanding of Focal Loss, particularly regarding its influence on predicted probability distributions under class imbalance. From a distributional perspective, the work introduces Focal Entropy as a novel theoretical framework, rigorously establishing—through tools from information theory, convex analysis, and asymptotic analysis—the existence and uniqueness of its minimizer. It further elucidates Focal Loss’s asymmetric modulation mechanism on samples with medium, high, and extremely low prediction probabilities. The analysis reveals the conditions under which Focal Entropy is finite, convex, and continuous, and critically identifies its tendency to excessively suppress extremely low probabilities in highly imbalanced settings. These findings provide a solid theoretical foundation for both the interpretation and practical deployment of Focal Loss.
📝 Abstract
The focal-loss has become a widely used alternative to cross-entropy in class-imbalanced classification problems, particularly in computer vision. Despite its empirical success, a systematic information-theoretic study of the focal-loss remains incomplete. In this work, we adopt a distributional viewpoint and study the focal-entropy, a focal-loss analogue of the cross-entropy. Our analysis establishes conditions for finiteness, convexity, and continuity of the focal-entropy, and provides various asymptotic characterizations. We prove the existence and uniqueness of the focal-entropy minimizer, describe its structure, and show that it can depart significantly from the data distribution. In particular, we rigorously show that the focal-loss amplifies mid-range probabilities, suppresses high-probability outcomes, and, under extreme class imbalance, induces an over-suppression regime in which very small probabilities are further diminished. These results, which are also experimentally validated, offer a theoretical foundation for understanding the focal-loss and clarify the trade-offs that it introduces when applied to imbalanced learning tasks.