Direct Reasoning Optimization: LLMs Can Reward And Refine Their Own Reasoning for Open-Ended Tasks

๐Ÿ“… 2025-06-16
๐Ÿ“ˆ Citations: 0
โœจ Influential: 0
๐Ÿ“„ PDF
๐Ÿค– AI Summary
To address the absence of verifiable external reward signals in open-ended long-horizon reasoning tasks, this paper proposes DRO, a self-driven reinforcement learning framework. Its core innovation is โ€œReasoning-Reflection Rewardโ€ (R3), a fine-grained, human-label-free reward signal dynamically generated by the model itself during chain-of-thought (CoT) reasoning, enabling internal alignment between intermediate steps and final outcomes. DRO establishes a fully self-contained training paradigm integrating R3-guided reinforcement learning, self-reflective reward modeling, and dynamic data filtering based on R3 confidence. Evaluated on ParaRev (paragraph revision) and FinQA (mathematical question answering), DRO significantly outperforms strong baselines while demonstrating robust generalization across both open-domain reasoning and structured reasoning tasks. To our knowledge, DRO is the first unsupervised, fine-grained, process-aware reasoning optimization framework.

Technology Category

Application Category

๐Ÿ“ Abstract
Recent advances in Large Language Models (LLMs) have showcased impressive reasoning abilities in structured tasks like mathematics and programming, largely driven by Reinforcement Learning with Verifiable Rewards (RLVR), which uses outcome-based signals that are scalable, effective, and robust against reward hacking. However, applying similar techniques to open-ended long-form reasoning tasks remains challenging due to the absence of generic, verifiable reward signals. To address this, we propose Direct Reasoning Optimization (DRO), a reinforcement learning framework for fine-tuning LLMs on open-ended, particularly long-form, reasoning tasks, guided by a new reward signal: the Reasoning Reflection Reward (R3). At its core, R3 selectively identifies and emphasizes key tokens in the reference outcome that reflect the influence of the model's preceding chain-of-thought reasoning, thereby capturing the consistency between reasoning and reference outcome at a fine-grained level. Crucially, R3 is computed internally using the same model being optimized, enabling a fully self-contained training setup. Additionally, we introduce a dynamic data filtering strategy based on R3 for open-ended reasoning tasks, reducing cost while improving downstream performance. We evaluate DRO on two diverse datasets -- ParaRev, a long-form paragraph revision task, and FinQA, a math-oriented QA benchmark -- and show that it consistently outperforms strong baselines while remaining broadly applicable across both open-ended and structured domains.
Problem

Research questions and friction points this paper is trying to address.

Lack of verifiable rewards for open-ended reasoning tasks
Difficulty in fine-tuning LLMs for long-form reasoning
Need for self-contained optimization in reasoning tasks
Innovation

Methods, ideas, or system contributions that make the work stand out.

DRO framework optimizes open-ended reasoning tasks
R3 reward signal emphasizes reasoning consistency
Dynamic data filtering reduces cost effectively
๐Ÿ”Ž Similar Papers
No similar papers found.