🤖 AI Summary
Existing unsupervised self-evolution methods for multimodal large language models rely on majority voting to generate pseudo-labels, which are prone to model bias and often lead to unstable training and performance degradation. This work proposes the Continuously Softened Retrospective Resampling (CSRS) framework, which introduces a retrospective re-reasoning mechanism to broaden exploration of long-tailed reasoning paths. By integrating softened frequency-based rewards with visual-semantic perturbations, CSRS enables, for the first time, continuous calibration and stable optimization of reasoning path quality, thereby strengthening the model’s reliance on mathematical logic over superficial visual cues. Evaluated on benchmarks such as MathVision, CSRS substantially enhances the reasoning performance of Qwen2.5-VL-7B and achieves state-of-the-art results in geometric tasks under unsupervised self-evolution settings.
📝 Abstract
In the unsupervised self-evolution of Multimodal Large Language Models, the quality of feedback signals during post-training is pivotal for stable and effective learning. However, existing self-evolution methods predominantly rely on majority voting to select the most frequent output as the pseudo-golden answer, which may stem from the model's intrinsic biases rather than guaranteeing the objective correctness of the reasoning paths. To counteract the degradation, we propose \textbf{C}ontinuous \textbf{S}oftened \textbf{R}etracing re\textbf{S}ampling (\textbf{CSRS}) in MLLM self-evolution. Specifically, we introduce a Retracing Re-inference Mechanism (\textbf{RRM}) that the model re-inferences from anchor points to expand the exploration of long-tail reasoning paths. Simultaneously, we propose Softened Frequency Reward (\textbf{SFR}), which replaces binary rewards with continuous signals, calibrating reward based on the answers' frequency across sampled reasoning sets. Furthermore, incorporated with Visual Semantic Perturbation (\textbf{VSP}), CSRS ensures the model prioritizes mathematical logic over visual superficiality. Experimental results demonstrate that CSRS significantly enhances the reasoning performance of Qwen2.5-VL-7B on benchmarks such as MathVision. We achieve state-of-the-art (SOTA) results in unsupervised self-evolution on geometric tasks. Our code is avaible at https://github.com/yyy195/CSRS.