π€ AI Summary
This work addresses the challenges of data imbalance and language prior bias in self-improvement training of multimodal large language models, which often lead to underutilization of visual information and insufficient learning from hard examples. To mitigate these issues, the authors propose the VISTA framework, which introduces a vision-aware attention mechanism to explicitly quantify and guide the modelβs focus toward critical visual cues. Additionally, a prefix resampling strategy is devised to reuse partially correct reasoning trajectories, thereby alleviating data bias. Integrating both supervised fine-tuning and preference learning, the proposed approach consistently enhances performance across diverse multimodal models and benchmarks, achieving an average improvement of 13.66% on Qwen2.5-VL-3B-Instruct.
π Abstract
Post-training with explicit reasoning traces is common to improve the reasoning capabilities of Multimodal Large Language Models (MLLMs). However, acquiring high-quality reasoning traces is often costly and time-consuming. Hence, the self-improvement paradigm has emerged, enabling MLLMs to self-generate reasoning traces for training without external supervision. Despite its effectiveness, we reveal two shortcomings in the self-improvement training of MLLMs: 1) data imbalance, where simple samples are over-trained, but the challenging yet crucial samples are under-trained; 2) language prior bias, where MLLMs overly rely on linguistic priors while neglecting the visual cues. To this end, we propose VISTA, a vision-aware self-improvement training framework for enhancing the multimodal reasoning of MLLMs. Specifically, VISTA first introduces a prefix resampling strategy to reuse the partial correct reasoning traces for efficient data collection, and then designs a vision-aware attention score to quantify the model's focus on visual information. Extensive experiments show that VISTA can be applied to various post-training scenarios, i.e., supervised fine-tuning and preference learning, and effectively enhances the multimodal reasoning performance across various MLLMs and tasks, e.g., bringing up to +13.66% average performance gains for Qwen2.5-VL-3B-Instruct.