RISE: Reliable Improvement in Self-Evolving Vision-Language Models

📅 2026-05-20

📈 Citations: 0

✨ Influential: 0

career value

170K/year

🤖 AI Summary

Existing self-evolution approaches for vision-language models suffer from coarse-grained role alternation, degradation in generated question quality, and collapse in question-type distribution, leading to inefficient and unreliable performance gains. This work proposes RISE, a novel framework that, for the first time, enables fine-grained question-answering role interaction, incorporates quality-aware supervision for both questions and pseudo-labels, and introduces a skill-aware dynamic sampling strategy within a dual-role closed-loop self-training architecture. By effectively mitigating mode collapse and ensuring balanced development across diverse skills, RISE achieves consistent and broad performance improvements across two mainstream vision-language model backbones and seven benchmark datasets, demonstrating its effectiveness and generalizability.

📝 Abstract

Vision-language models (VLMs) have achieved strong multimodal reasoning capabilities, but further improving them still relies heavily on large-scale human-constructed supervision for post-training. Such supervision is costly to obtain, especially for reasoning-intensive multimodal tasks where questions, answers, and feedback signals must be carefully designed. This motivates self-evolving learning, where a model improves itself through a dual-role closed loop: a questioner autonomously poses questions and a solver learns to solve them. However, we observe that current VLM self-evolving methods still face three major challenges: coarse-grained role alternation delays the interaction between question generation and solver adaptation; generated questions can progressively degrade in quality; and question types may collapse toward a narrow distribution. These issues limit the efficiency and reliability of self-evolution. Thus, we propose \textbf{RISE}, a reliable self-evolving framework for vision-language models. RISE is built on three complementary designs: fine-grained role alternation, which shortens the feedback loop between the questioner and the solver to improve efficiency; a quality supervisor, which improves question validity and pseudo-label reliability; and skill-aware dynamic balancing, which mitigates mode collapse and maintains broad skill coverage during evolution. Together, these components enable more reliable and effective self-evolution from unlabeled images. Experiments on two VLM backbones across seven benchmarks show that RISE consistently improves the base models, yielding broad and sustained gains. Our code is publicly available at https://github.com/AMAP-ML/RISE.

Problem

Research questions and friction points this paper is trying to address.

self-evolving

vision-language models

question generation

mode collapse

pseudo-label reliability

Innovation

Methods, ideas, or system contributions that make the work stand out.

self-evolving

vision-language models

fine-grained role alternation