Global Minimizers of Sigmoid Contrastive Loss

📅 2025-09-22

📈 Citations: 0

✨ Influential: 0

career value

157K/year

🤖 AI Summary

This work investigates the intrinsic mechanisms underlying cross-modal representation alignment under the Sigmoid contrastive loss. Motivated by SigLIP’s strong retrieval performance despite unclear origins of the modality gap, we introduce a novel geometric construct—the (m, b_rel)-Constellation—which formally characterizes, for the first time, the synergistic role of trainable inverse temperature and relative bias in driving loss convergence to zero. Leveraging spherical code theory, we rigorously derive the minimal embedding dimension required for high-quality representations and reveal that the modality gap stems from non-uniform distribution of features on the unit hypersphere. We further propose a loss reparameterization incorporating an explicit relative bias term. Experiments demonstrate that this formulation significantly improves training dynamics on synthetic data, accelerating convergence and enhancing representation quality.

Technology Category

Application Category

📝 Abstract

The meta-task of obtaining and aligning representations through contrastive pretraining is steadily gaining importance since its introduction in CLIP and ALIGN. In this paper we theoretically explain the advantages of synchronizing with trainable inverse temperature and bias under the sigmoid loss, as implemented in the recent SigLIP and SigLIP2 models of Google DeepMind. Temperature and bias can drive the loss function to zero for a rich class of configurations that we call $(mathsf{m}, mathsf{b}_{mathsf{rel}})$-Constellations. $(mathsf{m}, mathsf{b}_{mathsf{rel}})$-Constellations are a novel combinatorial object related to spherical codes and are parametrized by a margin $mathsf{m}$ and relative bias $mathsf{b}_{mathsf{rel}}$. We use our characterization of constellations to theoretically justify the success of SigLIP on retrieval, to explain the modality gap present in SigLIP, and to identify the necessary dimension for producing high-quality representations. Finally, we propose a reparameterization of the sigmoid loss with explicit relative bias, which improves training dynamics in experiments with synthetic data.

Problem

Research questions and friction points this paper is trying to address.

Theoretical analysis of sigmoid contrastive loss with trainable temperature and bias

Characterizing optimal configurations called (m, b_rel)-Constellations for representation learning

Explaining SigLIP model success on retrieval tasks and modality gap phenomena

Innovation

Methods, ideas, or system contributions that make the work stand out.

Synchronizing temperature and bias under sigmoid loss

Introducing novel combinatorial objects called Constellations

Reparameterizing sigmoid loss with explicit relative bias

🔎 Similar Papers

No similar papers found.

Apple

Cupertino, United States of America

Senior Machine Learning Engineer, Multimodal Perception (LLM/VLM)

Waymo

$213,000—$263,000 USD

Mountain View, CA, USA / Mountain View (US-MTV-EMF680), Mountain View, California, United States

Research Scientist Intern, Multimodal Generative AI and Robotics (PhD)