🤖 AI Summary
To address the limited reasoning capabilities of multimodal large language models (MLLMs), this paper proposes Process-Reward-Model (PRM)-guided Beam Annealing Search (BAS), a dynamic beam search strategy that adaptively reduces beam width during inference to balance accuracy and efficiency. Our key contributions are: (1) the first PRM-guided BAS mechanism, which deeply integrates a process reward model into the search policy; (2) the construction of PRM-BAS-300k, a large-scale, stepwise supervised dataset comprising 300K samples; and (3) joint optimization of value loss and ranking loss for PRM training. The method incurs low computational overhead, is architecture-agnostic, and exhibits strong generalizability and plug-and-play compatibility. Extensive experiments on multimodal reasoning benchmarks demonstrate significant performance gains, empirically validating the effectiveness of process-level supervision in enhancing MLLM reasoning.
📝 Abstract
Recent work increasingly focuses on improving the reasoning capabilities of Multimodal Large Language Models (MLLMs). Among existing methods, Process Reward Models (PRMs) stand out for offering dense, step-wise supervision to guide intermediate reasoning. However, how to effectively integrate PRMs into search strategies remains an open question. In this paper, we introduce PRM-BAS (PRM-Guided Beam Annealing Search), a lightweight approach for PRM-guided reasoning that dynamically adjusts beam size -- starting with a broader search space and gradually narrowing it as contextual information accumulates, thereby balancing performance and efficiency. We further propose a unified framework for data construction and PRM training. Specifically, we construct the PRM-BAS-300k dataset by selecting 300k questions from existing datasets and performing rollouts at each step to estimate the probability of reaching a correct final answer. The PRM is then trained using a combination of value loss for absolute action quality and rank loss for relative action quality. Extensive experiments on challenging multimodal reasoning benchmarks demonstrate that PRM-BAS significantly improves reasoning performance while maintaining low computational cost. Moreover, it generalizes well across different model scales and architectures, showcasing strong robustness and plug-and-play capability.