Co-Reward: Self-supervised Reinforcement Learning for Large Language Model Reasoning via Contrastive Agreement

📅 2025-08-01

📈 Citations: 0

✨ Influential: 0

career value

206K/year

🤖 AI Summary

To address the tendency of large language models (LLMs) to converge to trivial solutions when enhancing reasoning capabilities in unsupervised settings, this paper proposes Co-Reward—a self-supervised reinforcement learning framework grounded in contrastive consistency. The method constructs cross-validated reward signals using semantically similar analogical problem pairs, generates proxy labels via rollout-based voting, and explicitly models semantic consistency across reasoning paths—thereby eliminating reliance on human annotations. This design effectively mitigates reward collapse during self-reward training and substantially improves reasoning stability. On the MATH500 benchmark, Llama-3.2-3B-Instruct achieves a 6.8% absolute improvement over supervised fine-tuning with ground-truth labels and consistently outperforms existing self-rewarding approaches across multiple metrics.

Technology Category

Application Category

📝 Abstract

Although reinforcement learning with verifiable rewards (RLVR) shows promise in improving the reasoning ability of large language models (LLMs), the scaling up dilemma remains due to the reliance on human annotated labels especially for complex tasks. Recent alternatives that explore various self-reward signals exhibit the eliciting potential of LLM reasoning, but suffer from the non-negligible collapse issue. Inspired by the success of self-supervised learning, we propose extit{Co-Reward}, a novel RL framework that leverages contrastive agreement across semantically analogical questions as a reward basis. Specifically, we construct a similar question for each training sample (without labels) and synthesize their individual surrogate labels through a simple rollout voting, and then the reward is constructed by cross-referring the labels of each question pair to enforce the internal reasoning consistency across analogical inputs. Intuitively, such a self-supervised reward-shaping mechanism increases the difficulty of learning collapse into a trivial solution, and promotes stable reasoning elicitation and improvement through expanding the input sample variants. Empirically, Co-Reward achieves superior performance compared to other self-reward baselines on multiple reasoning benchmarks and LLM series, and reaches or even surpasses ground-truth (GT) labeled reward, with improvements of up to $+6.8%$ on MATH500 over GT reward on Llama-3.2-3B-Instruct. Our code is publicly available at https://github.com/tmlr-group/Co-Reward.

Problem

Research questions and friction points this paper is trying to address.

Self-supervised RL for LLM reasoning without human labels

Addressing reward collapse in self-rewarding LLM methods

Enhancing reasoning consistency via contrastive agreement rewards

Innovation

Methods, ideas, or system contributions that make the work stand out.

Self-supervised RL via contrastive agreement

Reward based on analogical question consistency

Rollout voting for surrogate label synthesis

🔎 Similar Papers

Self-playing Adversarial Language Game Enhances LLM Reasoning