Seeing to Generalize: How Visual Data Corrects Binding Shortcuts

📅 2026-02-16
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the limited generalization of vision-language models on purely textual tasks, particularly in out-of-distribution long-context retrieval scenarios. To enhance unimodal generalization, the authors propose a post-pretraining visual training phase on image-tokenization synthesis tasks applied to text-only pretrained models. Their analysis reveals that visual training introduces spatial translation invariance, which disrupts positional shortcuts learned during text-only training and encourages the adoption of more robust symbolic binding mechanisms. Empirical results demonstrate that this approach nearly doubles retrieval accuracy on out-of-distribution textual benchmarks, with the learned mechanisms remaining effective even after reverting to purely textual tasks. Key contributions include the design of synthetic tasks, a cross-modal training strategy, comparative evaluation of visual encoders, and interpretability analyses elucidating the underlying improvements.

Technology Category

Application Category

📝 Abstract
Vision Language Models (VLMs) are designed to extend Large Language Models (LLMs) with visual capabilities, yet in this work we observe a surprising phenomenon: VLMs can outperform their underlying LLMs on purely text-only tasks, particularly in long-context information retrieval. To investigate this effect, we build a controlled synthetic retrieval task and find that a transformer trained only on text achieves perfect in-distribution accuracy but fails to generalize out of distribution, while subsequent training on an image-tokenized version of the same task nearly doubles text-only OOD performance. Mechanistic interpretability reveals that visual training changes the model's internal binding strategy: text-only training encourages positional shortcuts, whereas image-based training disrupts them through spatial translation invariance, forcing the model to adopt a more robust symbolic binding mechanism that persists even after text-only examples are reintroduced. We further characterize how binding strategies vary across training regimes, visual encoders, and initializations, and show that analogous shifts occur during pretrained LLM-to-VLM transitions. Our findings suggest that cross-modal training can enhance reasoning and generalization even for tasks grounded in a single modality.
Problem

Research questions and friction points this paper is trying to address.

generalization
binding shortcuts
vision-language models
out-of-distribution
symbolic binding
Innovation

Methods, ideas, or system contributions that make the work stand out.

visual-language models
binding mechanisms
distributional generalization
positional shortcuts
cross-modal training
🔎 Similar Papers
No similar papers found.