Beyond I-Con: Exploring New Dimension of Distance Measures in Representation Learning

📅 2025-09-04
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This paper identifies inherent limitations of the Kullback–Leibler (KL) divergence in representation learning—including asymmetry, unboundedness, and misalignment with downstream objectives. To address these issues, we propose the Beyond I-Con framework, the first systematic approach to co-designing statistical divergences (e.g., total variation distance, bounded f-divergences) and similarity kernel functions (e.g., distance-based kernels) for loss construction. The framework decouples divergence selection from kernel design, enabling task-adaptive representation learning. Experiments on DINO-ViT clustering, supervised contrastive learning, and nonlinear dimensionality reduction demonstrate that our method consistently outperforms KL divergence–based baselines paired with angular kernels. Notably, improvements extend to downstream classification and retrieval performance, validating both the effectiveness and generalizability of divergence–kernel co-design.

Technology Category

Application Category

📝 Abstract
The Information Contrastive (I-Con) framework revealed that over 23 representation learning methods implicitly minimize KL divergence between data and learned distributions that encode similarities between data points. However, a KL-based loss may be misaligned with the true objective, and properties of KL divergence such as asymmetry and unboundedness may create optimization challenges. We present Beyond I-Con, a framework that enables systematic discovery of novel loss functions by exploring alternative statistical divergences and similarity kernels. Key findings: (1) on unsupervised clustering of DINO-ViT embeddings, we achieve state-of-the-art results by modifying the PMI algorithm to use total variation (TV) distance; (2) on supervised contrastive learning, we outperform the standard approach by using TV and a distance-based similarity kernel instead of KL and an angular kernel; (3) on dimensionality reduction, we achieve superior qualitative results and better performance on downstream tasks than SNE by replacing KL with a bounded f-divergence. Our results highlight the importance of considering divergence and similarity kernel choices in representation learning optimization.
Problem

Research questions and friction points this paper is trying to address.

Exploring alternative divergences beyond KL for representation learning
Addressing KL divergence asymmetry and unboundedness in optimization
Systematic discovery of novel loss functions using statistical divergences
Innovation

Methods, ideas, or system contributions that make the work stand out.

Uses total variation distance for clustering
Employs TV distance with distance kernel
Replaces KL with bounded f-divergence
J
Jasmine L. Shone
Department of Computer Science, Massachusetts Institute of Technology, Cambridge, MA 02139
Shaden Alshammari
Shaden Alshammari
Graduate student at MIT
Machine LearningComputer Vision
Mark Hamilton
Mark Hamilton
MIT, Microsoft
machine learningcomputer visiondistributed systemsunsupervised learning
Z
Zhening Li
Department of Computer Science, Massachusetts Institute of Technology, Cambridge, MA 02139
W
William Freeman
Department of Computer Science, Massachusetts Institute of Technology, Cambridge, MA 02139; Google