PRM-BAS: Enhancing Multimodal Reasoning through PRM-guided Beam Annealing Search

📅 2025-04-14

📈 Citations: 0

✨ Influential: 0

career value

192K/year

🤖 AI Summary

To address the limited reasoning capabilities of multimodal large language models (MLLMs), this paper proposes Process-Reward-Model (PRM)-guided Beam Annealing Search (BAS), a dynamic beam search strategy that adaptively reduces beam width during inference to balance accuracy and efficiency. Our key contributions are: (1) the first PRM-guided BAS mechanism, which deeply integrates a process reward model into the search policy; (2) the construction of PRM-BAS-300k, a large-scale, stepwise supervised dataset comprising 300K samples; and (3) joint optimization of value loss and ranking loss for PRM training. The method incurs low computational overhead, is architecture-agnostic, and exhibits strong generalizability and plug-and-play compatibility. Extensive experiments on multimodal reasoning benchmarks demonstrate significant performance gains, empirically validating the effectiveness of process-level supervision in enhancing MLLM reasoning.

Technology Category

Application Category

📝 Abstract

Recent work increasingly focuses on improving the reasoning capabilities of Multimodal Large Language Models (MLLMs). Among existing methods, Process Reward Models (PRMs) stand out for offering dense, step-wise supervision to guide intermediate reasoning. However, how to effectively integrate PRMs into search strategies remains an open question. In this paper, we introduce PRM-BAS (PRM-Guided Beam Annealing Search), a lightweight approach for PRM-guided reasoning that dynamically adjusts beam size -- starting with a broader search space and gradually narrowing it as contextual information accumulates, thereby balancing performance and efficiency. We further propose a unified framework for data construction and PRM training. Specifically, we construct the PRM-BAS-300k dataset by selecting 300k questions from existing datasets and performing rollouts at each step to estimate the probability of reaching a correct final answer. The PRM is then trained using a combination of value loss for absolute action quality and rank loss for relative action quality. Extensive experiments on challenging multimodal reasoning benchmarks demonstrate that PRM-BAS significantly improves reasoning performance while maintaining low computational cost. Moreover, it generalizes well across different model scales and architectures, showcasing strong robustness and plug-and-play capability.

Problem

Research questions and friction points this paper is trying to address.

Enhance multimodal reasoning with PRM-guided search

Balance performance and efficiency in beam search

Train PRMs using value and rank losses

Innovation

Methods, ideas, or system contributions that make the work stand out.

PRM-guided Beam Annealing Search dynamically adjusts beam size

Unified framework for data construction and PRM training

Combines value loss and rank loss for PRM training

🔎 Similar Papers

Annealed Multiple Choice Learning: Overcoming limitations of Winner-takes-all with annealing