The Majority is not always right: RL training for solution aggregation

📅 2025-09-08

📈 Citations: 0

✨ Influential: 0

career value

184K/year

🤖 AI Summary

Existing large language models (LLMs) improve complex reasoning by generating multiple candidate solutions and applying static aggregation strategies—e.g., majority voting or reward-model ranking—which suffer from poor generalization and often discard correct minority solutions. Method: We propose AggLM, which frames multi-solution aggregation as a learnable reasoning skill. It employs end-to-end reinforcement learning guided by verifiable reward signals to optimize the aggregation policy. AggLM jointly leverages rule-based baselines and reward models, dynamically balancing easy and hard samples to enable solution scrutiny, reconciliation, and synthesis. Contribution/Results: AggLM significantly outperforms strong rule-based and reward-model baselines across multiple reasoning benchmarks, achieving higher accuracy with fewer tokens. Crucially, it generalizes to solutions generated by stronger LLMs unseen during training—demonstrating, for the first time, that aggregation itself constitutes a transferable reasoning capability.

Technology Category

Application Category

📝 Abstract

Scaling up test-time compute, by generating multiple independent solutions and selecting or aggregating among them, has become a central paradigm for improving large language models (LLMs) on challenging reasoning tasks. While most prior work relies on simple majority voting or reward model ranking to aggregate solutions, these approaches may only yield limited benefits. In this work, we propose to learn aggregation as an explicit reasoning skill: given a set of candidate solutions, we train an aggregator model to review, reconcile, and synthesize a final, correct answer using reinforcement learning from verifiable rewards. A key ingredient is careful balancing of easy and hard training examples, allowing the model to learn both to recover minority-but-correct answers as well as easy majority-correct answers. Empirically, we find our method, AggLM, outperforms both strong rule-based and reward-model baselines, across multiple benchmarks. Furthermore, it generalizes effectively to solutions from differing models, including stronger ones than contained in the training data, all while requiring substantially fewer tokens than majority voting with larger numbers of solutions.

Problem

Research questions and friction points this paper is trying to address.

Improving solution aggregation in reasoning tasks for LLMs

Learning to synthesize correct answers from multiple candidates

Overcoming limitations of simple majority voting approaches

Innovation

Methods, ideas, or system contributions that make the work stand out.

Uses reinforcement learning for solution aggregation

Balances easy and hard training examples

Generalizes to solutions from different models

🔎 Similar Papers

No similar papers found.