๐ค AI Summary
This work addresses the high computational cost of sampling-and-rerankingโbased text generation during inference and its reliance on gold references or human preference data. To this end, the authors propose C-GRPO, a novel method that reformulates the consensus utility from minimum Bayes risk (MBR) decoding as a group-wise relative optimization objective within a policy gradient framework. C-GRPO enables end-to-end, reference-free training using only a utility function and policy-generated samples. Theoretical analysis shows that the gradient of its objective aligns with the direction of expected MBR utility. Experiments on WMT 2024 machine translation and XSum summarization demonstrate that C-GRPO matches the performance of MBR decoding, significantly outperforms other reference-free baselines, and entirely eliminates the need for repeated sampling and scoring at inference time.
๐ Abstract
Many strong decoding methods for text generation follow a sample-and-rerank paradigm: they draw multiple candidates, score each under a utility (reward) function using consensus across samples, and return the best one. Although effective, these methods incur high computational costs during inference due to repeated sampling and scoring. Prior attempts to amortize inference-time computation typically rely on gold references, teacher labels, or curated preference data, increasing dataset construction effort and the demand for high-fidelity reward models. We propose Consensus Group Relative Policy Optimization (C-GRPO), which distills Minimum Bayes Risk (MBR) decoding into training by formulating the consensus utility as a group-relative objective within GRPO. C-GRPO requires only a utility function and policy samples, without gold references or explicit preference labels. Under ideal conditions, we show that the objective function of C-GRPO is directionally aligned with the gradient of the expected-utility objective underlying MBR decoding, leading to a convergence guarantee. Experiments on machine translation (WMT 2024) and text summarization (XSum) demonstrate that C-GRPO successfully achieves performance comparable to MBR decoding without the associated inference-time overhead, while outperforming reference-free baseline methods.