Wisdom of the Crowd: Reinforcement Learning from Coevolutionary Collective Feedback

📅 2025-08-17
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing RL approaches for enhancing LLM reasoning rely on costly human annotations or sophisticated reward models; self-feedback methods, in contrast, suffer from overconfidence, reward hacking, and training instability due to inherent limitations of single-model supervision. This paper proposes a multi-LLM collaborative evolutionary RL framework that operates without external supervision: leveraging output diversity across multiple LLMs, it uses collective consensus as an intrinsic reward signal and incorporates a self-consistency-weighted voting mechanism to enable population-level co-evolution. By breaking the bottleneck of single-model feedback, the method effectively mitigates reward bias and improves training stability. Evaluated on four open-source LLMs and four mathematical reasoning benchmarks, it achieves an average absolute accuracy gain of 16.72% and boosts majority-vote accuracy by 4.51%, significantly expanding the frontier of unsupervised autonomous learning.

Technology Category

Application Category

📝 Abstract
Reinforcement learning (RL) has significantly enhanced the reasoning capabilities of large language models (LLMs), but its reliance on expensive human-labeled data or complex reward models severely limits scalability. While existing self-feedback methods aim to address this problem, they are constrained by the capabilities of a single model, which can lead to overconfidence in incorrect answers, reward hacking, and even training collapse. To this end, we propose Reinforcement Learning from Coevolutionary Collective Feedback (RLCCF), a novel RL framework that enables multi-model collaborative evolution without external supervision. Specifically, RLCCF optimizes the ability of a model collective by maximizing its Collective Consistency (CC), which jointly trains a diverse ensemble of LLMs and provides reward signals by voting on collective outputs. Moreover, each model's vote is weighted by its Self-Consistency (SC) score, ensuring that more confident models contribute more to the collective decision. Benefiting from the diverse output distributions and complementary abilities of multiple LLMs, RLCCF enables the model collective to continuously enhance its reasoning ability through coevolution. Experiments on four mainstream open-source LLMs across four mathematical reasoning benchmarks demonstrate that our framework yields significant performance gains, achieving an average relative improvement of 16.72% in accuracy. Notably, RLCCF not only improves the performance of individual models but also enhances the group's majority-voting accuracy by 4.51%, demonstrating its ability to extend the collective capability boundary of the model collective.
Problem

Research questions and friction points this paper is trying to address.

Reducing reliance on costly human-labeled data in RL for LLMs
Overcoming single-model limitations like overconfidence and reward hacking
Enhancing multi-model collaboration without external supervision
Innovation

Methods, ideas, or system contributions that make the work stand out.

Reinforcement Learning from Coevolutionary Collective Feedback
Maximizing Collective Consistency for reward signals
Weighting votes by Self-Consistency scores
🔎 Similar Papers
No similar papers found.