PROPA: Toward Process-level Optimization in Visual Reasoning via Reinforcement Learning

📅 2025-11-13
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Visual-language models (VLMs) suffer from error accumulation in multi-step visual reasoning, leading to degraded performance. Existing approaches are constrained by either the high cost of step-level supervision (e.g., supervised fine-tuning) or sparse, outcome-level feedback (e.g., GRPO), hindering stable optimization of intermediate reasoning steps. To address this, we propose MCTS-PRM: a novel framework that introduces the first process-level reward model (PRM) requiring no human-annotated reasoning traces. It integrates Monte Carlo tree search with group relative policy optimization (GRPO), leveraging interleaved policy alignment to generate dense, differentiable step-wise feedback—effectively mitigating cold-start issues. Extensive experiments across seven benchmarks and four VLM architectures demonstrate consistent improvements: +17.0% in-domain and +21.0% cross-domain accuracy, significantly enhancing both reasoning fidelity and generalization capability.

Technology Category

Application Category

📝 Abstract
Despite significant progress, Vision-Language Models (VLMs) still struggle with complex visual reasoning, where multi-step dependencies cause early errors to cascade through the reasoning chain. Existing post-training paradigms are limited: Supervised Fine-Tuning (SFT) relies on costly step-level annotations, while Reinforcement Learning with Verifiable Rewards (RLVR) methods like GRPO provide only sparse, outcome-level feedback, hindering stable optimization. We introduce PROPA (Process-level Reasoning Optimization with interleaved Policy Alignment), a novel framework that integrates Monte Carlo Tree Search (MCTS) with GRPO to generate dense, process-level rewards and optimize reasoning at each intermediate step without human annotations. To overcome the cold-start problem, PROPA interleaves GRPO updates with SFT, enabling the model to learn from both successful and failed reasoning trajectories. A Process Reward Model (PRM) is further trained to guide inference-time search, aligning the test-time search with the training signal. Across seven benchmarks and four VLM backbones, PROPA consistently outperforms both SFT- and RLVR-based baselines. It achieves up to 17.0% gains on in-domain tasks and 21.0% gains on out-of-domain tasks compared to existing state-of-the-art, establishing a strong reasoning and generalization capability for visual reasoning tasks. The code isavailable at: https://github.com/YanbeiJiang/PROPA.
Problem

Research questions and friction points this paper is trying to address.

Optimizing multi-step visual reasoning to prevent error propagation
Providing dense process-level rewards without human annotations
Overcoming cold-start problem by combining reinforcement and supervised learning
Innovation

Methods, ideas, or system contributions that make the work stand out.

Uses MCTS with GRPO for process-level rewards
Interleaves GRPO updates with SFT for cold-start
Trains Process Reward Model to guide inference search
🔎 Similar Papers
No similar papers found.