π€ AI Summary
This work addresses the high cost and redundancy of Monte Carloβbased annotations in training Multimodal Process Reward Models (MPRMs). It formally characterizes, for the first time, the information-gradient update mechanism underlying MPRM learning, revealing that effective training hinges on the mixture balance and reliability of positive and negative step labels. Building on this insight, the authors propose Balanced-Information Score (BIS), a data selection strategy that requires no additional annotations. BIS integrates retrospective step-level scoring, explicit modeling of label mixture balance and reliability, and efficient subset selection. Evaluated on VisualProcessBench, BIS achieves full-data performance using only 10% of the training samples and yields a 4.1% relative improvement over random sampling, substantially enhancing data efficiency.
π Abstract
Multimodal Process Reward Models (MPRMs) are central to step-level supervision for visual reasoning in MLLMs. Training MPRMs typically requires large-scale Monte Carlo (MC)-annotated corpora, incurring substantial training cost. This paper studies the data efficiency for MPRM training. Our preliminary experiments reveal that MPRM training quickly saturates under random subsampling of the training data, indicating substantial redundancy within existing MC-annotated corpora. To explain this, we formalize a theoretical framework and reveal that informative gradient updates depend on two factors: label mixtures of positive/negative steps and label reliability (average MC scores of positive steps). Guided by these insights, we propose the Balanced-Information Score (BIS), which prioritizes both mixture and reliability based on existing MC signals at the rollout level, without incurring any additional cost. Across two backbones (InternVL2.5-8B and Qwen2.5-VL-7B) on VisualProcessBench, BIS-selected subsets consistently match and even surpass the full-data performance at small fractions. Notably, the BIS subset reaches full-data performance using only 10% of the training data, improving over random subsampling by a relative 4.1%.