π€ AI Summary
This work challenges the common practice in contrastive learning of using cosine similarity, which implicitly assumes that embedding norms are noise and discards their potential semantic information. Through a systematic 2Γ2 ablation study that independently controls input and output normalization in both text and vision encoders, the authors investigate the functional role of embedding norms. They propose a task symmetry principle: preserving norm information significantly improves performance in asymmetric tasks such as text retrieval, but harms performance in symmetric tasks. Furthermore, they reveal an asymmetric functional distinction between input and output norms. By combining controlled normalization, ablation experiments, and Cohenβs d effect size analysis, the study demonstrates that merely removing redundant unit hypersphere constraints at inference yields zero-cost performance gains on dense text retrieval benchmarks.
π Abstract
Cosine similarity is prevalent in contrastive learning, yet it makes an implicit assumption: embedding magnitude is noise. Prior work occasionally found dot product and cosine similarity comparable, but left unanswered WHAT information magnitude carries, WHEN it helps, and HOW to leverage it. We conduct a systematic study through a $2 \times 2$ ablation that independently controls input-side and output-side normalization across text and vision models. Our findings reveal three key insights. First, in text retrieval, output (document) magnitude strongly correlates with relevance (Cohen's $d$ up to 1.80), yielding the largest gains on reasoning-intensive tasks. Second, input and output magnitudes serve asymmetric roles: output magnitude directly scales similarity scores while input magnitude modulates training dynamics. Third, magnitude learning benefits asymmetric tasks (text retrieval, RAG) but harms symmetric tasks (STS, text-image alignment). These findings establish a task symmetry principle: the choice between cosine and dot product depends on whether the task has distinct input roles, enabling cost-free improvements by simply removing an unnecessary constraint.