🤖 AI Summary
This paper identifies inherent limitations of the Kullback–Leibler (KL) divergence in representation learning—including asymmetry, unboundedness, and misalignment with downstream objectives. To address these issues, we propose the Beyond I-Con framework, the first systematic approach to co-designing statistical divergences (e.g., total variation distance, bounded f-divergences) and similarity kernel functions (e.g., distance-based kernels) for loss construction. The framework decouples divergence selection from kernel design, enabling task-adaptive representation learning. Experiments on DINO-ViT clustering, supervised contrastive learning, and nonlinear dimensionality reduction demonstrate that our method consistently outperforms KL divergence–based baselines paired with angular kernels. Notably, improvements extend to downstream classification and retrieval performance, validating both the effectiveness and generalizability of divergence–kernel co-design.
📝 Abstract
The Information Contrastive (I-Con) framework revealed that over 23 representation learning methods implicitly minimize KL divergence between data and learned distributions that encode similarities between data points. However, a KL-based loss may be misaligned with the true objective, and properties of KL divergence such as asymmetry and unboundedness may create optimization challenges. We present Beyond I-Con, a framework that enables systematic discovery of novel loss functions by exploring alternative statistical divergences and similarity kernels. Key findings: (1) on unsupervised clustering of DINO-ViT embeddings, we achieve state-of-the-art results by modifying the PMI algorithm to use total variation (TV) distance; (2) on supervised contrastive learning, we outperform the standard approach by using TV and a distance-based similarity kernel instead of KL and an angular kernel; (3) on dimensionality reduction, we achieve superior qualitative results and better performance on downstream tasks than SNE by replacing KL with a bounded f-divergence. Our results highlight the importance of considering divergence and similarity kernel choices in representation learning optimization.