MM-HELIX: Boosting Multimodal Long-Chain Reflective Reasoning with Holistic Platform and Adaptive Hybrid Policy Optimization

📅 2025-10-09

📈 Citations: 0

✨ Influential: 0

career value

240K/year

🤖 AI Summary

Current multimodal large language models (MLLMs) exhibit strong performance on mathematical and logical reasoning but lack long-chain reflective reasoning capabilities essential for solving complex real-world problems. To address this, we propose MM-HELIX, the first benchmark dedicated to multimodal long-chain reflective reasoning, comprising 1,260 carefully curated samples. We introduce a Step-Elicited response generation mechanism and an Adaptive Hybrid Policy Optimization (AHPO) training strategy that dynamically integrates offline supervised learning with online reinforcement learning to mitigate sparse reward signals and catastrophic forgetting. Leveraging a data synthesis engine and the high-quality reflective trajectory dataset MM-HELIX-100K, our approach jointly enhances iterative thinking and backtracking abilities. Experiments show that our method achieves an 18.6% accuracy gain on MM-HELIX and an average 5.7% improvement across general mathematical and logical reasoning benchmarks, demonstrating strong generalization and cross-task transferability.

Technology Category

Application Category

📝 Abstract

While current Multimodal Large Language Models (MLLMs) have demonstrated proficiency in reasoning tasks such as mathematics and logic, their capacity for long-chain reflective reasoning, a prerequisite for solving complex real-world problems, remains largely underexplored. In this work, we first conduct an extensive empirical investigation to evaluate this capability. Leveraging a carefully designed data synthesis engine, we construct MM-HELIX, a multimodal benchmark consisting 1,260 samples of 42 challenging synthetic tasks that require iterative thinking and backtracking. Empirical results on this benchmark reveal that existing MLLMs exhibit significant performance deficits in long-chain reflective reasoning. To address this limitation, we generate post-training data and further explore learning paradigms for exploiting such data. We first develop the Step-Elicited Response Generation pipeline to create MM-HELIX-100K, a large-scale dataset of 100k high-quality, reflective reasoning traces for instruction-tuning stage. Given that standard Reinforcement Learning fails on complex tasks due to sparse reward signals and catastrophic forgetting after Supervised Fine-Tuning, we propose Adaptive Hybrid Policy Optimization (AHPO), a novel training strategy that dynamically unifies offline supervision and online optimization into a single stage. This strategy enables the model to learn from expert data when rewards are sparse and conduct independent exploration once proficient. When applied to the Qwen2.5-VL-7B baseline, our method achieves a +18.6% accuracy improvement on MM-HELIX benchmark and demonstrates strong generalization with a +5.7% average performance gain on general mathematic and logic tasks. Our work demonstrate that reflective reasoning in MLLMs can be effectively learned and generalized, paving the way for developing more capable MLLMs.

Problem

Research questions and friction points this paper is trying to address.

Addressing multimodal long-chain reflective reasoning limitations in MLLMs

Creating benchmark and dataset for iterative thinking and backtracking tasks

Developing adaptive optimization to overcome sparse rewards and forgetting

Innovation

Methods, ideas, or system contributions that make the work stand out.

Step-Elicited Response Generation for instruction-tuning data

Adaptive Hybrid Policy Optimization unifies supervision and exploration

Holistic platform synthesizes multimodal benchmark for reflective reasoning

🔎 Similar Papers

No similar papers found.