Rethinking Model Selection in VLM Through the Lens of Gromov-Wasserstein Distance

๐Ÿ“… 2026-05-02
๐Ÿ“ˆ Citations: 0
โœจ Influential: 0
๐Ÿ“„ PDF

career value

235K/year
๐Ÿค– AI Summary
Current strategies for selecting vision-language models (VLMs)โ€”such as relying on model scale or zero-shot accuracyโ€”lack a systematic understanding of the compatibility between visual encoders and large language models, limiting their effectiveness. This work constructs a benchmark comprising 19 pre-trained visual encoders and introduces, for the first time, the Gromov-Wasserstein distance as a training-free metric to quantify cross-modal structural similarity for predicting VLM performance. The study further establishes a theoretical connection between this distance and the learnability of cross-modal alignment. Across more than 60 full-scale VLM training experiments, the proposed metric significantly outperforms existing approaches, demonstrating strong correlation with final model performance and enabling efficient and accurate pre-screening of model components.
๐Ÿ“ Abstract
Vision-Language Models (VLMs) have enhanced traditional LLMs with visual capabilities through the integration of vision encoders. While recent works have explored various combinations of vision encoders and LLMs, there still lacks a principled understanding of what makes a vision encoder suitable for VLM alignment. In this paper, we systematically investigate this question via comprehensive experiments on a curated collection of 19 pre-trained vision encoders from diverse sources. We first demonstrate that common practices, such as choosing encoders with the largest size or highest zero-shot accuracy, consistently fail to identify optimal models. In fact, these metrics show only weak to moderate correlation with VLM performance. This intriguing finding begs a fundamental question: What factors of vision-encoders matter in VLM? Through comprehensive analysis, we identify that the structural similarity across modalities plays a crucial but previously overlooked role in vision-encoder selection, which we measure using the Gromov-Wasserstein distance as a proxy. From a theoretical perspective, we show that the learnability of cross-modality mapping can be provably associated with the Gromov-Wasserstein distance. Empirical verification on 60+ full VLM training runs shows that our proposed inference-only metric performs significantly better than alternative model selection strategies and exhibits a much stronger correlation with final VLM performance, thereby enabling efficient and effective prediction of VLM performance before full training.
Problem

Research questions and friction points this paper is trying to address.

Vision-Language Models
Model Selection
Vision Encoders
Gromov-Wasserstein Distance
Cross-Modality Alignment
Innovation

Methods, ideas, or system contributions that make the work stand out.

Gromov-Wasserstein distance
vision-language models
model selection
cross-modality alignment
structural similarity
๐Ÿ”Ž Similar Papers
No similar papers found.