Debate as Reward: A Multi-Agent Reward System for Scientific Ideation via RL Post-Training

📅 2026-04-17

📈 Citations: 0

✨ Influential: 0

career value

227K/year

🤖 AI Summary

This work addresses the challenges of reward hacking, hallucination, and computational inefficiency in large language models during scientific ideation, which stem from imperfect reward signals. To mitigate these issues, the paper proposes a novel reinforcement learning framework that incorporates the first multi-agent debate mechanism specifically designed for evaluating scientific creativity, delivering strict binary rewards. This approach is combined with an unbiased Group Relative Policy Optimization algorithm for post-training. By decoupling evaluation from implementation, the framework effectively suppresses reward manipulation. Experimental results on the ICLR-320 dataset demonstrate that the proposed method significantly outperforms state-of-the-art baselines across expert-evaluated dimensions of novelty, feasibility, and effectiveness.

Technology Category

Application Category

📝 Abstract

Large Language Models (LLMs) have demonstrated potential in automating scientific ideation, yet current approaches relying on iterative prompting or complex multi-agent architectures often suffer from hallucination or computational inefficiency. A critical bottleneck in applying Reinforcement Learning (RL) to this open-ended domain is reward hacking -- where models exploit imperfect evaluation proxies to maximize scores without producing genuine scientific innovation. To address these limitations, we propose an RL framework explicitly tailored for high-quality scientific idea generation. We propose the first multi-agent reward function designed to serve as a judge, decoupling methodological validation from implementation details while providing strict binary rewards that are robust to reward hacking. To effectively optimize against this sparse signal, we utilize an unbiased variant of Group Relative Policy Optimization to mitigate artificial length bias. We grounded our training in ICLR-320, a curated dataset of problem-solution pairs extracted from ICLR 2024 proceedings. Experiments demonstrate that our framework significantly outperforms state-of-the-art baselines across expert-evaluated metrics of novelty, feasibility, and effectiveness.

Problem

Research questions and friction points this paper is trying to address.

reward hacking

scientific ideation

reinforcement learning

large language models

multi-agent reward

Innovation

Methods, ideas, or system contributions that make the work stand out.

multi-agent reward

reward hacking

scientific ideation