🤖 AI Summary
This study investigates whether Vision Transformers (ViTs) are indispensable as visual encoders in vision-language models (VLMs), presenting the first systematic evaluation of state space models (SSMs) as alternative backbones. Under consistent ImageNet-1K initialization conditions and using a lightweight connector to interface with a large language model, the authors compare SSMs and ViTs on visual question answering and localization tasks, while also examining the impact of dense-task fine-tuning and training stability. Results demonstrate that SSMs achieve comparable or superior performance to ViTs at smaller model scales; dense fine-tuning consistently enhances performance across architectures; and the proposed stabilization strategy significantly improves robustness for both backbone types. The work further reveals a weak correlation between ImageNet accuracy and downstream VLM performance, offering new insights for visual encoder design.
📝 Abstract
Large vision--language models (VLMs) often use a frozen vision backbone, whose image features are mapped into a large language model through a lightweight connector. While transformer-based encoders are the standard visual backbone, we ask whether state space model (SSM) vision backbones can be a strong alternative. We systematically evaluate SSM vision backbones for VLMs in a controlled setting. Under matched ImageNet-1K initialization, the SSM backbone achieves the strongest overall performance across both VQA and grounding/localization. We further adapt both SSM and ViT-family backbones with detection or segmentation training and find that dense-task tuning generally improves performance across families; after this adaptation, the SSM backbone remains competitive while operating at a substantially smaller model scale. We further observe that (i) higher ImageNet accuracy or larger backbones do not reliably translate into better VLM performance, and (ii) some visual backbones are unstable in localization. Based on these findings, we propose stabilization strategies that improve robustness for both backbone families and highlight SSM backbones as a strong alternative to transformer-based vision encoders in VLMs.