🤖 AI Summary
This work addresses a critical limitation in current medical large language models that rely on majority voting as the supervision signal in test-time reinforcement learning (TTRL), noting that statistical consensus does not necessarily reflect clinically correct reasoning in complex scenarios. To overcome this, the study proposes a unified training paradigm that integrates, for the first time, a Medical Reasoning Process Model (Med-RPM) into both TTRL and test-time scaling (TTS) frameworks. By replacing majority voting with expert-aligned, fine-grained process rewards, the approach guides models toward clinically valid reasoning trajectories. Experiments across four medical reasoning benchmarks demonstrate that this method significantly outperforms existing TTRL approaches and standalone process reward models, confirming the effectiveness and scalability of structured process-based rewards and marking a paradigm shift from statistical consensus to clinical correctness.
📝 Abstract
Recent advances in medical large language models have explored Test-Time Reinforcement Learning (TTRL) to enhance reasoning. However, standard TTRL often relies on majority voting (MV) as a heuristic supervision signal, which can be unreliable in complex medical scenarios where the most frequent reasoning path is not necessarily the clinically correct one. In this work, we propose a novel and unified training paradigm that integrates medical process reward models with TTRL to bridge the gap between test-time scaling (TTS) and parametric model optimization. Specifically, we advance the TTRL framework by replacing the conventional MV with a fine-grained, expert-aligned supervision paradigm using Med-RPM. This integration ensures that reinforcement learning is guided by medical correctness rather than mere consensus, effectively distilling search-based intelligence into the model's parametric memory. Extensive evaluations on four different benchmarks have demonstrated that our developed method consistently and significantly outperforms current TTRL and standalone PRM selection. Our findings establish that transitioning from stochastic heuristics to structured, step-wise rewards is essential for developing reliable and scalable medical AI systems