V-Zero: Self-Improving Multimodal Reasoning with Zero Annotation

📅 2026-01-15

📈 Citations: 0

✨ Influential: 0

career value

185K/year

🤖 AI Summary

This work proposes the first fully unsupervised, self-improving multimodal reasoning framework to address the reliance of vision-language models on large-scale human-annotated data. By leveraging a co-evolutionary mechanism between a Questioner and a Solver, the framework achieves continuous optimization using only unlabeled images. It introduces a dual-track reasoning reward scheme, a population-voting-based pseudo-labeling strategy, and employs Group Relative Policy Optimization (GRPO) for iterative training. Evaluated on Qwen2.5-VL-7B-Instruct, the method yields significant performance gains, improving visual mathematical reasoning and general vision tasks by 1.7 and 2.6 points, respectively, thereby demonstrating the effectiveness of unsupervised self-improvement in multimodal reasoning.

Technology Category

Application Category

📝 Abstract

Recent advances in multimodal learning have significantly enhanced the reasoning capabilities of vision-language models (VLMs). However, state-of-the-art approaches rely heavily on large-scale human-annotated datasets, which are costly and time-consuming to acquire. To overcome this limitation, we introduce V-Zero, a general post-training framework that facilitates self-improvement using exclusively unlabeled images. V-Zero establishes a co-evolutionary loop by instantiating two distinct roles: a Questioner and a Solver. The Questioner learns to synthesize high-quality, challenging questions by leveraging a dual-track reasoning reward that contrasts intuitive guesses with reasoned results. The Solver is optimized using pseudo-labels derived from majority voting over its own sampled responses. Both roles are trained iteratively via Group Relative Policy Optimization (GRPO), driving a cycle of mutual enhancement. Remarkably, without a single human annotation, V-Zero achieves consistent performance gains on Qwen2.5-VL-7B-Instruct, improving visual mathematical reasoning by +1.7 and general vision-centric by +2.6, demonstrating the potential of self-improvement in multimodal systems. Code is available at https://github.com/SatonoDia/V-Zero

Problem

Research questions and friction points this paper is trying to address.

multimodal reasoning

human annotation

vision-language models

data annotation cost

Innovation

Methods, ideas, or system contributions that make the work stand out.

self-improvement

zero annotation

multimodal reasoning

co-evolutionary framework

pseudo-labeling

🔎 Similar Papers

No similar papers found.