Stabilizing Unsupervised Self-Evolution of MLLMs via Continuous Softened Retracing reSampling

📅 2026-04-04
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing unsupervised self-evolution methods for multimodal large language models rely on majority voting to generate pseudo-labels, which are prone to model bias and often lead to unstable training and performance degradation. This work proposes the Continuously Softened Retrospective Resampling (CSRS) framework, which introduces a retrospective re-reasoning mechanism to broaden exploration of long-tailed reasoning paths. By integrating softened frequency-based rewards with visual-semantic perturbations, CSRS enables, for the first time, continuous calibration and stable optimization of reasoning path quality, thereby strengthening the model’s reliance on mathematical logic over superficial visual cues. Evaluated on benchmarks such as MathVision, CSRS substantially enhances the reasoning performance of Qwen2.5-VL-7B and achieves state-of-the-art results in geometric tasks under unsupervised self-evolution settings.
📝 Abstract
In the unsupervised self-evolution of Multimodal Large Language Models, the quality of feedback signals during post-training is pivotal for stable and effective learning. However, existing self-evolution methods predominantly rely on majority voting to select the most frequent output as the pseudo-golden answer, which may stem from the model's intrinsic biases rather than guaranteeing the objective correctness of the reasoning paths. To counteract the degradation, we propose \textbf{C}ontinuous \textbf{S}oftened \textbf{R}etracing re\textbf{S}ampling (\textbf{CSRS}) in MLLM self-evolution. Specifically, we introduce a Retracing Re-inference Mechanism (\textbf{RRM}) that the model re-inferences from anchor points to expand the exploration of long-tail reasoning paths. Simultaneously, we propose Softened Frequency Reward (\textbf{SFR}), which replaces binary rewards with continuous signals, calibrating reward based on the answers' frequency across sampled reasoning sets. Furthermore, incorporated with Visual Semantic Perturbation (\textbf{VSP}), CSRS ensures the model prioritizes mathematical logic over visual superficiality. Experimental results demonstrate that CSRS significantly enhances the reasoning performance of Qwen2.5-VL-7B on benchmarks such as MathVision. We achieve state-of-the-art (SOTA) results in unsupervised self-evolution on geometric tasks. Our code is avaible at https://github.com/yyy195/CSRS.
Problem

Research questions and friction points this paper is trying to address.

unsupervised self-evolution
Multimodal Large Language Models
feedback signals
reasoning paths
pseudo-golden answer
Innovation

Methods, ideas, or system contributions that make the work stand out.

Continuous Softened Retracing reSampling
Retracing Re-inference Mechanism
Softened Frequency Reward
Visual Semantic Perturbation
Unsupervised Self-Evolution
🔎 Similar Papers
No similar papers found.
Y
Yunyao Yu
Tsinghua University
Zhengxian Wu
Zhengxian Wu
Tsinghua University
Computer Vision、Large Language Model
Z
Zhuohong Chen
Tsinghua University
H
Hangrui Xu
Tsinghua University
Z
Zirui Liao
Tsinghua University
X
Xiangwen Deng
Tsinghua University
Zhifang Liu
Zhifang Liu
School of Mathematical Sciences, Tianjin Normal University
image processing
S
Senyuan Shi
Tsinghua University
H
Haoqian Wang
Tsinghua University