🤖 AI Summary
This work addresses a critical yet previously overlooked issue in contrastive learning–based vision-language models (VLMs): their shared latent space is contaminated by substantial architecture-level shared noise that is non-semantic in nature, degrading representation quality. The study is the first to identify and characterize this phenomenon and proposes a novel method based on spectral decomposition of the covariance matrix to disentangle the latent space into a semantic signal subspace and a shared noise subspace. Leveraging subgroup invariance, the approach further identifies and prunes noise-corrupted dimensions. Experimental results demonstrate that removing these noise dimensions preserves or even improves downstream task performance, confirming that the proposed mechanism effectively separates semantic content from spurious correlations and substantially enhances the geometric structure of multimodal representations.
📝 Abstract
Contrastively pre-trained Vision-Language Models (VLMs) serve as powerful feature extractors. Yet, their shared latent spaces are prone to structural anomalies and act as repositories for non-semantic, multi-modal noise. To address this phenomenon, we employ spectral decomposition of covariance matrices to decompose the VLM latent space into a multi-modal semantic signal component and a shared noise subspace. We observe that this noise geometry exhibits strong subgroup invariance across distinct data subsets. Crucially, pruning these shared noise dimensions is mainly harmless, preserving or actively improving downstream task performance. By isolating true semantic signals from artifactual noise, this work provides new mechanistic insights into the representational structure of modern VLMs, suggesting that a substantial fraction of their latent geometry is governed by shared, architecture-level noise rather than task-relevant semantics alone.