π€ AI Summary
This work addresses the challenge of designing effective reward signals for reinforcement learning in open-domain long-form text generation, where reliable reference answers and automatic evaluation metrics are lacking. To overcome this, the authors propose a novel approach that introduces a multi-round intra-group tournament mechanism, leveraging large language models as judges to perform relative comparisons within groups based on predefined scoring criteria. This process yields normalized cumulative rewards to guide GRPO training, effectively mitigating issues such as poor calibration of absolute scores, low discriminative power, and optimization saturation. Evaluated on the Deep Research Bench, the method achieves a 4.52-point improvement over the strongest baseline, demonstrating a superior trade-off between effectiveness and efficiency.
π Abstract
Reinforcement learning in open-ended long-form generation is challenging because reliable reference answers and automatic metrics are often unavailable. Existing rubric-based methods typically rely on pointwise LLM-as-a-judge scoring, but absolute scores are difficult to calibrate across complex responses, may provide weak discrimination among same-query rollouts, and can become saturated during optimization. We propose Tournament-GRPO, a group-wise reward framework that converts rubric-guided LLM judgments into relative rewards through repeated multi-round tournaments among same-query rollouts. Tournament-GRPO compares candidates within groups, accumulates tournament outcomes, and normalizes them into group-wise rewards for GRPO training. Experiments on Deep Research Bench show that Tournament-GRPO consistently outperforms existing reward-design baselines, achieving a 4.52-point overall-score improvement over the strongest baseline. Further analyses show that tournament rewards provide a favorable effectiveness--efficiency trade-off and that tournament design affects training dynamics. These results suggest that rubric-guided tournament comparison provides an effective reward signal for reinforcement learning in open-ended long-form generation.