Rethinking Model Selection in VLM Through the Lens of Gromov-Wasserstein Distance

📅 2026-05-02

📈 Citations: 0

✨ Influential: 0

career value

235K/year

🤖 AI Summary

Current strategies for selecting vision-language models (VLMs)—such as relying on model scale or zero-shot accuracy—lack a systematic understanding of the compatibility between visual encoders and large language models, limiting their effectiveness. This work constructs a benchmark comprising 19 pre-trained visual encoders and introduces, for the first time, the Gromov-Wasserstein distance as a training-free metric to quantify cross-modal structural similarity for predicting VLM performance. The study further establishes a theoretical connection between this distance and the learnability of cross-modal alignment. Across more than 60 full-scale VLM training experiments, the proposed metric significantly outperforms existing approaches, demonstrating strong correlation with final model performance and enabling efficient and accurate pre-screening of model components.

📝 Abstract

Vision-Language Models (VLMs) have enhanced traditional LLMs with visual capabilities through the integration of vision encoders. While recent works have explored various combinations of vision encoders and LLMs, there still lacks a principled understanding of what makes a vision encoder suitable for VLM alignment. In this paper, we systematically investigate this question via comprehensive experiments on a curated collection of 19 pre-trained vision encoders from diverse sources. We first demonstrate that common practices, such as choosing encoders with the largest size or highest zero-shot accuracy, consistently fail to identify optimal models. In fact, these metrics show only weak to moderate correlation with VLM performance. This intriguing finding begs a fundamental question: What factors of vision-encoders matter in VLM? Through comprehensive analysis, we identify that the structural similarity across modalities plays a crucial but previously overlooked role in vision-encoder selection, which we measure using the Gromov-Wasserstein distance as a proxy. From a theoretical perspective, we show that the learnability of cross-modality mapping can be provably associated with the Gromov-Wasserstein distance. Empirical verification on 60+ full VLM training runs shows that our proposed inference-only metric performs significantly better than alternative model selection strategies and exhibits a much stronger correlation with final VLM performance, thereby enabling efficient and effective prediction of VLM performance before full training.

Problem

Research questions and friction points this paper is trying to address.

Vision-Language Models

Model Selection

Vision Encoders

Gromov-Wasserstein Distance

Cross-Modality Alignment

Innovation

Methods, ideas, or system contributions that make the work stand out.

Gromov-Wasserstein distance

vision-language models

model selection