Diving into Self-Evolving Training for Multimodal Reasoning

📅 2024-12-23

🏛️ arXiv.org

📈 Citations: 15

✨ Influential: 2

career value

187K/year

🤖 AI Summary

Self-evolutionary training for multimodal reasoning often suffers from performance saturation and heavy reliance on human annotations. Method: This paper introduces the first systematic analytical framework for this task, decomposing the training process into three core components—training mechanism, reward modeling, and prompt mutation—and uncovering their dynamic equilibrium principles. Based on this analysis, we propose MSTaR, a general and efficient framework integrating reinforcement learning–driven self-evolutionary training, multi-stage reward modeling, prompt diversity enhancement, and dynamic weight balancing. Results: MSTaR is rigorously validated across three foundation models—MiniCPM-V-2.5, Phi-3.5-Vision, and InternVL2—and achieves significant improvements over strong baselines on five major multimodal reasoning benchmarks, entirely without human annotation. We publicly release the policy model, reward model, and synthetically generated data to advance unsupervised multimodal reasoning research.

Technology Category

Application Category

📝 Abstract

Reasoning ability is essential for Large Multimodal Models (LMMs). In the absence of multimodal chain-of-thought annotated data, self-evolving training, where the model learns from its own outputs, has emerged as an effective and scalable approach for enhancing reasoning abilities. Despite its growing usage, a comprehensive understanding of self-evolving training, particularly in the context of multimodal reasoning, remains limited. In this paper, we delve into the intricacies of self-evolving training for multimodal reasoning, pinpointing three key factors: Training Method, Reward Model, and Prompt Variation. We systematically examine each factor and explore how various configurations affect the training's effectiveness. Our analysis leads to a set of best practices for each factor, aimed at optimizing multimodal reasoning. Furthermore, we explore the Self-Evolution Dynamics during training and the impact of automatic balancing mechanisms in boosting performance. After all the investigations, we present a final recipe for self-evolving training in multimodal reasoning, encapsulating these design choices into a framework we call MSTaR (Multimodal Self-evolving Training for Reasoning), which is universally effective for models with different sizes on various benchmarks, e.g., surpassing the pre-evolved model significantly on 5 multimodal reasoning benchmarks without using additional human annotations, as demonstrated on MiniCPM-V-2.5 (8B), Phi-3.5-Vision (4B) and InternVL2 (2B). We believe this study fills a significant gap in the understanding of self-evolving training for multimodal reasoning and offers a robust framework for future research. Our policy and reward models, as well as the collected data, is released to facilitate further investigation in multimodal reasoning.

Problem

Research questions and friction points this paper is trying to address.

Addressing scarcity of high-quality chain-of-thought data in multimodal reasoning

Exploring critical factors and performance saturation in self-evolving training

Enhancing multimodal reasoning through RL-inspired training methods and balancing mechanisms

Innovation

Methods, ideas, or system contributions that make the work stand out.

Reframes self-evolving training using reinforcement learning

Introduces automatic balancing to mitigate performance saturation

Proposes M-STAR framework for consistent performance gains

🔎 Similar Papers

No similar papers found.