Stabilizing Unsupervised Self-Evolution of MLLMs via Continuous Softened Retracing reSampling

📅 2026-04-04

📈 Citations: 0

✨ Influential: 0

career value

192K/year

🤖 AI Summary

Existing unsupervised self-evolution methods for multimodal large language models rely on majority voting to generate pseudo-labels, which are prone to model bias and often lead to unstable training and performance degradation. This work proposes the Continuously Softened Retrospective Resampling (CSRS) framework, which introduces a retrospective re-reasoning mechanism to broaden exploration of long-tailed reasoning paths. By integrating softened frequency-based rewards with visual-semantic perturbations, CSRS enables, for the first time, continuous calibration and stable optimization of reasoning path quality, thereby strengthening the model’s reliance on mathematical logic over superficial visual cues. Evaluated on benchmarks such as MathVision, CSRS substantially enhances the reasoning performance of Qwen2.5-VL-7B and achieves state-of-the-art results in geometric tasks under unsupervised self-evolution settings.

Technology Category

Application Category

📝 Abstract

In the unsupervised self-evolution of Multimodal Large Language Models, the quality of feedback signals during post-training is pivotal for stable and effective learning. However, existing self-evolution methods predominantly rely on majority voting to select the most frequent output as the pseudo-golden answer, which may stem from the model's intrinsic biases rather than guaranteeing the objective correctness of the reasoning paths. To counteract the degradation, we propose \textbf{C}ontinuous \textbf{S}oftened \textbf{R}etracing re\textbf{S}ampling (\textbf{CSRS}) in MLLM self-evolution. Specifically, we introduce a Retracing Re-inference Mechanism (\textbf{RRM}) that the model re-inferences from anchor points to expand the exploration of long-tail reasoning paths. Simultaneously, we propose Softened Frequency Reward (\textbf{SFR}), which replaces binary rewards with continuous signals, calibrating reward based on the answers' frequency across sampled reasoning sets. Furthermore, incorporated with Visual Semantic Perturbation (\textbf{VSP}), CSRS ensures the model prioritizes mathematical logic over visual superficiality. Experimental results demonstrate that CSRS significantly enhances the reasoning performance of Qwen2.5-VL-7B on benchmarks such as MathVision. We achieve state-of-the-art (SOTA) results in unsupervised self-evolution on geometric tasks. Our code is avaible at https://github.com/yyy195/CSRS.

Problem

Research questions and friction points this paper is trying to address.

unsupervised self-evolution

Multimodal Large Language Models

feedback signals

reasoning paths

pseudo-golden answer

Innovation

Methods, ideas, or system contributions that make the work stand out.

Continuous Softened Retracing reSampling

Retracing Re-inference Mechanism

Softened Frequency Reward