PRM-BAS: Enhancing Multimodal Reasoning through PRM-guided Beam Annealing Search

📅 2025-04-14
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address the limited reasoning capabilities of multimodal large language models (MLLMs), this paper proposes Process-Reward-Model (PRM)-guided Beam Annealing Search (BAS), a dynamic beam search strategy that adaptively reduces beam width during inference to balance accuracy and efficiency. Our key contributions are: (1) the first PRM-guided BAS mechanism, which deeply integrates a process reward model into the search policy; (2) the construction of PRM-BAS-300k, a large-scale, stepwise supervised dataset comprising 300K samples; and (3) joint optimization of value loss and ranking loss for PRM training. The method incurs low computational overhead, is architecture-agnostic, and exhibits strong generalizability and plug-and-play compatibility. Extensive experiments on multimodal reasoning benchmarks demonstrate significant performance gains, empirically validating the effectiveness of process-level supervision in enhancing MLLM reasoning.

Technology Category

Application Category

📝 Abstract
Recent work increasingly focuses on improving the reasoning capabilities of Multimodal Large Language Models (MLLMs). Among existing methods, Process Reward Models (PRMs) stand out for offering dense, step-wise supervision to guide intermediate reasoning. However, how to effectively integrate PRMs into search strategies remains an open question. In this paper, we introduce PRM-BAS (PRM-Guided Beam Annealing Search), a lightweight approach for PRM-guided reasoning that dynamically adjusts beam size -- starting with a broader search space and gradually narrowing it as contextual information accumulates, thereby balancing performance and efficiency. We further propose a unified framework for data construction and PRM training. Specifically, we construct the PRM-BAS-300k dataset by selecting 300k questions from existing datasets and performing rollouts at each step to estimate the probability of reaching a correct final answer. The PRM is then trained using a combination of value loss for absolute action quality and rank loss for relative action quality. Extensive experiments on challenging multimodal reasoning benchmarks demonstrate that PRM-BAS significantly improves reasoning performance while maintaining low computational cost. Moreover, it generalizes well across different model scales and architectures, showcasing strong robustness and plug-and-play capability.
Problem

Research questions and friction points this paper is trying to address.

Enhance multimodal reasoning with PRM-guided search
Balance performance and efficiency in beam search
Train PRMs using value and rank losses
Innovation

Methods, ideas, or system contributions that make the work stand out.

PRM-guided Beam Annealing Search dynamically adjusts beam size
Unified framework for data construction and PRM training
Combines value loss and rank loss for PRM training
🔎 Similar Papers
No similar papers found.
P
Pengfei Hu
University of Science and Technology of China
Z
Zhenrong Zhang
University of Science and Technology of China
Qikai Chang
Qikai Chang
University of Science and Technology of China
OCRLLM
S
Shuhang Liu
University of Science and Technology of China
J
Jie Ma
University of Science and Technology of China
J
Jun Du
University of Science and Technology of China
J
Jianshu Zhang
iFLYTEK Research
Q
Quan Liu
iFLYTEK Research
J
Jianqing Gao
iFLYTEK Research
F
Feng Ma
iFLYTEK Research
Qingfeng Liu
Qingfeng Liu
Professor, Hosei University
Econometrics