Consensus Group Relative Policy Optimization for Text Generation

๐Ÿ“… 2026-02-03
๐Ÿ“ˆ Citations: 0
โœจ Influential: 0
๐Ÿ“„ PDF
๐Ÿค– AI Summary
This work addresses the high computational cost of sampling-and-rerankingโ€“based text generation during inference and its reliance on gold references or human preference data. To this end, the authors propose C-GRPO, a novel method that reformulates the consensus utility from minimum Bayes risk (MBR) decoding as a group-wise relative optimization objective within a policy gradient framework. C-GRPO enables end-to-end, reference-free training using only a utility function and policy-generated samples. Theoretical analysis shows that the gradient of its objective aligns with the direction of expected MBR utility. Experiments on WMT 2024 machine translation and XSum summarization demonstrate that C-GRPO matches the performance of MBR decoding, significantly outperforms other reference-free baselines, and entirely eliminates the need for repeated sampling and scoring at inference time.

Technology Category

Application Category

๐Ÿ“ Abstract
Many strong decoding methods for text generation follow a sample-and-rerank paradigm: they draw multiple candidates, score each under a utility (reward) function using consensus across samples, and return the best one. Although effective, these methods incur high computational costs during inference due to repeated sampling and scoring. Prior attempts to amortize inference-time computation typically rely on gold references, teacher labels, or curated preference data, increasing dataset construction effort and the demand for high-fidelity reward models. We propose Consensus Group Relative Policy Optimization (C-GRPO), which distills Minimum Bayes Risk (MBR) decoding into training by formulating the consensus utility as a group-relative objective within GRPO. C-GRPO requires only a utility function and policy samples, without gold references or explicit preference labels. Under ideal conditions, we show that the objective function of C-GRPO is directionally aligned with the gradient of the expected-utility objective underlying MBR decoding, leading to a convergence guarantee. Experiments on machine translation (WMT 2024) and text summarization (XSum) demonstrate that C-GRPO successfully achieves performance comparable to MBR decoding without the associated inference-time overhead, while outperforming reference-free baseline methods.
Problem

Research questions and friction points this paper is trying to address.

text generation
decoding
computational cost
reference-free
consensus scoring
Innovation

Methods, ideas, or system contributions that make the work stand out.

Consensus Group Relative Policy Optimization
Minimum Bayes Risk decoding
reference-free training
relative policy optimization
text generation
๐Ÿ”Ž Similar Papers
No similar papers found.