EvoLMM: Self-Evolving Large Multimodal Models with Continuous Rewards

📅 2025-11-20

📈 Citations: 0

✨ Influential: 0

career value

216K/year

🤖 AI Summary

Existing large multimodal models (LMMs) rely heavily on human-annotated data or external reward models, limiting their autonomy and scalability. Method: We propose EvoLMM—the first fully unsupervised self-evolution framework for LMMs—built upon a Proposer-Solver dual-agent architecture: the Proposer autonomously generates questions from raw images, while the Solver reasons over multimodal inputs to produce answers; both agents jointly optimize via intrinsic consistency feedback, enabling self-supervised reward learning without external supervision. Contribution/Results: Implemented end-to-end on Qwen2.5-VL, EvoLMM eliminates reliance on manual annotations or external reward models. It achieves an average +3% improvement on ChartQA, MathVista, and MathVision benchmarks using only unlabeled image data, demonstrating stable performance gains and significantly advancing autonomous multimodal reasoning in LMMs.

Technology Category

Application Category

📝 Abstract

Recent advances in large multimodal models (LMMs) have enabled impressive reasoning and perception abilities, yet most existing training pipelines still depend on human-curated data or externally verified reward models, limiting their autonomy and scalability. In this work, we strive to improve LMM reasoning capabilities in a purely unsupervised fashion (without any annotated data or reward distillation). To this end, we propose a self-evolving framework, named EvoLMM, that instantiates two cooperative agents from a single backbone model: a Proposer, which generates diverse, image-grounded questions, and a Solver, which solves them through internal consistency, where learning proceeds through a continuous self-rewarding process. This dynamic feedback encourages both the generation of informative queries and the refinement of structured reasoning without relying on ground-truth or human judgments. When using the popular Qwen2.5-VL as the base model, our EvoLMM yields consistent gains upto $sim$3% on multimodal math-reasoning benchmarks, including ChartQA, MathVista, and MathVision, using only raw training images. We hope our simple yet effective approach will serve as a solid baseline easing future research in self-improving LMMs in a fully-unsupervised fashion. Our code and models are available at https://github.com/mbzuai-oryx/EvoLMM.

Problem

Research questions and friction points this paper is trying to address.

Developing self-evolving multimodal models without human-curated data

Enhancing reasoning capabilities through unsupervised continuous self-rewarding

Eliminating dependency on annotated data and external reward models

Innovation

Methods, ideas, or system contributions that make the work stand out.

Self-evolving framework with two cooperative agents

Continuous self-rewarding process without human data

Unsupervised learning using internal consistency checks

🔎 Similar Papers

EvoAgent: Towards Automatic Multi-Agent Generation via Evolutionary Algorithms