DARL: Encouraging Diverse Answers for General Reasoning without Verifiers

📅 2026-01-21

📈 Citations: 0

✨ Influential: 0

career value

194K/year

🤖 AI Summary

Existing general-purpose reinforcement learning (RL) methods for reasoning are prone to overfitting reference answers, struggling to generate diverse yet semantically aligned outputs in open-ended tasks. To address this limitation, this work proposes DARL, a novel framework that explicitly promotes output diversity within standard RL without requiring an additional verifier. DARL introduces a controllable deviation mechanism that encourages plausible variations while preserving semantic consistency with the reference answer. The framework is compatible with existing RL approaches and consistently enhances performance across 13 benchmarks: it achieves an average improvement of 1.3 points over RLPR on six reasoning benchmarks and 9.5 points on seven general-purpose benchmarks, effectively overcoming the reliance on a single canonical answer inherent in conventional methods.

Technology Category

Application Category

📝 Abstract

Reinforcement Learning with Verifiable Rewards (RLVR) has demonstrated promising gains in enhancing the reasoning capabilities of large language models. However, its dependence on domain-specific verifiers significantly restricts its applicability to open and general domains. Recent efforts such as RLPR have extended RLVR to general domains, enabling training on broader datasets and achieving improvements over RLVR. However, a notable limitation of these methods is their tendency to overfit to reference answers, which constrains the model's ability to generate diverse outputs. This limitation is particularly pronounced in open-ended tasks such as writing, where multiple plausible answers exist. To address this, we propose DARL, a simple yet effective reinforcement learning framework that encourages the generation of diverse answers within a controlled deviation range from the reference while preserving alignment with it. Our framework is fully compatible with existing general reinforcement learning methods and can be seamlessly integrated without additional verifiers. Extensive experiments on thirteen benchmarks demonstrate consistent improvements in reasoning performance. Notably, DARL surpasses RLPR, achieving average gains of 1.3 points on six reasoning benchmarks and 9.5 points on seven general benchmarks, highlighting its effectiveness in improving both reasoning accuracy and output diversity.

Problem

Research questions and friction points this paper is trying to address.

answer diversity

overfitting to reference answers

open-ended reasoning

general reinforcement learning

Innovation

Methods, ideas, or system contributions that make the work stand out.

DARL

reinforcement learning

answer diversity

general reasoning