🤖 AI Summary
This work addresses the lack of theoretical justification for heuristic practices such as batch centering and knowledge distillation in existing self-supervised clustering methods. By formulating self-supervised learning as a constrained KL divergence minimization problem, the proposed framework introduces an optimized teacher distribution to mitigate mode collapse and incorporates an inverse clustering prior to enforce feature normalization. For the first time, it establishes an information-theoretic foundation for batch centering and knowledge distillation, revealing their intrinsic connection to Jensen’s inequality. This principled perspective not only elucidates the empirical effectiveness of current approaches but also provides a theoretically grounded and verifiable basis for the design of future self-supervised clustering algorithms.
📝 Abstract
Self-supervised learning (SSL) is recognized as an essential tool for building foundation models for Artificial Intelligence applications. The advances in SSL have been made thanks to vigorous arguments about the principles of SSL and through extensive empirical research. The aim of this paper is to contribute to the development of the underpinning theory of SSL, focusing on the deep clustering approach. By analogy to supervised learning, we formulate SSL as K-L divergence optimization.
The mode collapse is prevented by imposing an optimisation constraint on the teacher distribution. This leads to normalization using inverse cluster priors. We show that using Jensen inequality this normalization simplifies to the popular batch centering procedure. Distillation and centering are common {heuristics-based} practices in SSL, {but our work underpins them theoretically.} The theoretical model developed not only supports specific existing successful SSL methods, but also suggests directions for future investigations.