๐ค AI Summary
Current strategies for selecting vision-language models (VLMs)โsuch as relying on model scale or zero-shot accuracyโlack a systematic understanding of the compatibility between visual encoders and large language models, limiting their effectiveness. This work constructs a benchmark comprising 19 pre-trained visual encoders and introduces, for the first time, the Gromov-Wasserstein distance as a training-free metric to quantify cross-modal structural similarity for predicting VLM performance. The study further establishes a theoretical connection between this distance and the learnability of cross-modal alignment. Across more than 60 full-scale VLM training experiments, the proposed metric significantly outperforms existing approaches, demonstrating strong correlation with final model performance and enabling efficient and accurate pre-screening of model components.
๐ Abstract
Vision-Language Models (VLMs) have enhanced traditional LLMs with visual capabilities through the integration of vision encoders. While recent works have explored various combinations of vision encoders and LLMs, there still lacks a principled understanding of what makes a vision encoder suitable for VLM alignment. In this paper, we systematically investigate this question via comprehensive experiments on a curated collection of 19 pre-trained vision encoders from diverse sources. We first demonstrate that common practices, such as choosing encoders with the largest size or highest zero-shot accuracy, consistently fail to identify optimal models. In fact, these metrics show only weak to moderate correlation with VLM performance. This intriguing finding begs a fundamental question: What factors of vision-encoders matter in VLM? Through comprehensive analysis, we identify that the structural similarity across modalities plays a crucial but previously overlooked role in vision-encoder selection, which we measure using the Gromov-Wasserstein distance as a proxy. From a theoretical perspective, we show that the learnability of cross-modality mapping can be provably associated with the Gromov-Wasserstein distance. Empirical verification on 60+ full VLM training runs shows that our proposed inference-only metric performs significantly better than alternative model selection strategies and exhibits a much stronger correlation with final VLM performance, thereby enabling efficient and effective prediction of VLM performance before full training.