MM-Zero: Self-Evolving Multi-Model Vision Language Models From Zero Data

📅 2026-03-10

📈 Citations: 0

✨ Influential: 0

career value

213K/year

🤖 AI Summary

This work addresses the challenge of enabling vision-language models (VLMs) to undergo self-evolutionary training without any initial data. The authors propose MM-Zero, a novel framework that introduces a tripartite collaborative mechanism comprising a Proposer, a Coder, and a Solver. The Proposer generates abstract visual concepts and formulates questions, which the Coder translates into executable code (e.g., Python or SVG) to render corresponding images; the Solver then performs multimodal reasoning on these synthesized examples. Departing from conventional dual-agent paradigms, MM-Zero integrates Group Relative Policy Optimization (GRPO), visual verification, and a difficulty-balanced reward mechanism to achieve end-to-end self-evolution from scratch. Experiments demonstrate that MM-Zero substantially enhances VLM reasoning performance across multiple multimodal benchmarks, offering a scalable pathway toward autonomous model evolution.

Technology Category

Application Category

📝 Abstract

Self-evolving has emerged as a key paradigm for improving foundational models such as Large Language Models (LLMs) and Vision Language Models (VLMs) with minimal human intervention. While recent approaches have demonstrated that LLM agents can self-evolve from scratch with little to no data, VLMs introduce an additional visual modality that typically requires at least some seed data, such as images, to bootstrap the self-evolution process. In this work, we present Multi-model Multimodal Zero (MM-Zero), the first RL-based framework to achieve zero-data self-evolution for VLM reasoning. Moving beyond prior dual-role (Proposer and Solver) setups, MM-Zero introduces a multi-role self-evolving training framework comprising three specialized roles: a Proposer that generates abstract visual concepts and formulates questions; a Coder that translates these concepts into executable code (e.g., Python, SVG) to render visual images; and a Solver that performs multimodal reasoning over the generated visual content. All three roles are initialized from the same base model and trained using Group Relative Policy Optimization (GRPO), with carefully designed reward mechanisms that integrate execution feedback, visual verification, and difficulty balancing. Our experiments show that MM-Zero improves VLM reasoning performance across a wide range of multimodal benchmarks. MM-Zero establishes a scalable path toward self-evolving multi-model systems for multimodal models, extending the frontier of self-improvement beyond the conventional two-model paradigm.

Problem

Research questions and friction points this paper is trying to address.

zero-data

self-evolving

vision language models

multimodal reasoning

visual modality

Innovation

Methods, ideas, or system contributions that make the work stand out.

zero-data self-evolution

multi-role framework

vision language models

reinforcement learning

executable code generation

🔎 Similar Papers

No similar papers found.

Apple

Cupertino, United States of America

AI Research Scientist, VLM (vision language models)