Learn to Think: Improving Multimodal Reasoning through Vision-Aware Self-Improvement Training

πŸ“… 2026-05-12
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF

career value

174K/year
πŸ€– AI Summary
This work addresses the challenges of data imbalance and language prior bias in self-improvement training of multimodal large language models, which often lead to underutilization of visual information and insufficient learning from hard examples. To mitigate these issues, the authors propose the VISTA framework, which introduces a vision-aware attention mechanism to explicitly quantify and guide the model’s focus toward critical visual cues. Additionally, a prefix resampling strategy is devised to reuse partially correct reasoning trajectories, thereby alleviating data bias. Integrating both supervised fine-tuning and preference learning, the proposed approach consistently enhances performance across diverse multimodal models and benchmarks, achieving an average improvement of 13.66% on Qwen2.5-VL-3B-Instruct.
πŸ“ Abstract
Post-training with explicit reasoning traces is common to improve the reasoning capabilities of Multimodal Large Language Models (MLLMs). However, acquiring high-quality reasoning traces is often costly and time-consuming. Hence, the self-improvement paradigm has emerged, enabling MLLMs to self-generate reasoning traces for training without external supervision. Despite its effectiveness, we reveal two shortcomings in the self-improvement training of MLLMs: 1) data imbalance, where simple samples are over-trained, but the challenging yet crucial samples are under-trained; 2) language prior bias, where MLLMs overly rely on linguistic priors while neglecting the visual cues. To this end, we propose VISTA, a vision-aware self-improvement training framework for enhancing the multimodal reasoning of MLLMs. Specifically, VISTA first introduces a prefix resampling strategy to reuse the partial correct reasoning traces for efficient data collection, and then designs a vision-aware attention score to quantify the model's focus on visual information. Extensive experiments show that VISTA can be applied to various post-training scenarios, i.e., supervised fine-tuning and preference learning, and effectively enhances the multimodal reasoning performance across various MLLMs and tasks, e.g., bringing up to +13.66% average performance gains for Qwen2.5-VL-3B-Instruct.
Problem

Research questions and friction points this paper is trying to address.

multimodal reasoning
self-improvement training
data imbalance
language prior bias
vision-aware
Innovation

Methods, ideas, or system contributions that make the work stand out.

vision-aware attention
self-improvement training
multimodal reasoning
prefix resampling
language prior bias
πŸ”Ž Similar Papers
No similar papers found.