🤖 AI Summary
This work investigates the intrinsic mechanisms underlying cross-modal representation alignment under the Sigmoid contrastive loss. Motivated by SigLIP’s strong retrieval performance despite unclear origins of the modality gap, we introduce a novel geometric construct—the (m, b_rel)-Constellation—which formally characterizes, for the first time, the synergistic role of trainable inverse temperature and relative bias in driving loss convergence to zero. Leveraging spherical code theory, we rigorously derive the minimal embedding dimension required for high-quality representations and reveal that the modality gap stems from non-uniform distribution of features on the unit hypersphere. We further propose a loss reparameterization incorporating an explicit relative bias term. Experiments demonstrate that this formulation significantly improves training dynamics on synthetic data, accelerating convergence and enhancing representation quality.
📝 Abstract
The meta-task of obtaining and aligning representations through contrastive pretraining is steadily gaining importance since its introduction in CLIP and ALIGN. In this paper we theoretically explain the advantages of synchronizing with trainable inverse temperature and bias under the sigmoid loss, as implemented in the recent SigLIP and SigLIP2 models of Google DeepMind. Temperature and bias can drive the loss function to zero for a rich class of configurations that we call $(mathsf{m}, mathsf{b}_{mathsf{rel}})$-Constellations. $(mathsf{m}, mathsf{b}_{mathsf{rel}})$-Constellations are a novel combinatorial object related to spherical codes and are parametrized by a margin $mathsf{m}$ and relative bias $mathsf{b}_{mathsf{rel}}$. We use our characterization of constellations to theoretically justify the success of SigLIP on retrieval, to explain the modality gap present in SigLIP, and to identify the necessary dimension for producing high-quality representations. Finally, we propose a reparameterization of the sigmoid loss with explicit relative bias, which improves training dynamics in experiments with synthetic data.