🤖 AI Summary
This work addresses the underperformance of multimodal large language models on fine-grained visual reasoning tasks, which stems primarily from their overreliance on linguistic priors during instruction tuning at the expense of visual information. The authors propose an innovative approach that reformulates classic self-supervised vision tasks—such as rotation prediction and color matching—into image-instruction-answer triplets, integrating them into the visual instruction tuning process via natural language instructions. By merely adjusting the training data distribution with just 3%–10% visually grounded instructions, the method effectively steers models to base their responses on visual evidence. This strategy requires no architectural modifications or additional training stages, yet consistently yields significant performance gains on vision-centric benchmarks across multiple models.
📝 Abstract
Multimodal large language models (MLLMs) perform well on many vision-language tasks but often struggle with vision-centric problems that require fine-grained visual reasoning. Recent evidence suggests that this limitation arises not from weak visual representations, but from under-utilization of visual information during instruction tuning, where many tasks can be partially solved using language priors alone. We propose a simple and lightweight approach that augments visual instruction tuning with a small number of visually grounded self-supervised tasks expressed as natural language instructions. By reformulating classical self-supervised pretext tasks, such as rotation prediction, color matching, and cross-view correspondence, as image-instruction-response triplets, we introduce supervision that cannot be solved without relying on visual evidence. Our approach requires no human annotations, no architectural modifications, and no additional training stages. Across multiple models, training regimes, and benchmarks, injecting only a small fraction (3-10%) of such visually grounded instructions consistently improves performance on vision-centric evaluations. Our findings highlight instruction tuning with visually grounded SSL tasks as a powerful lever for improving visual reasoning in MLLMs through simple adjustments to the training data distribution. Code available at: https://github.com/sirkosophia/V-GIFT