Characterizing, Evaluating, and Optimizing Complex Reasoning

๐Ÿ“… 2026-02-09
๐Ÿ“ˆ Citations: 0
โœจ Influential: 0
๐Ÿ“„ PDF
๐Ÿค– AI Summary
This work addresses the lack of a unified definition, reliable evaluation, and effective optimization mechanisms for complex reasoning quality in existing methods. To this end, the authors propose the MEยฒ principle to formally characterize reasoning quality, modeling reasoning trajectories as directed acyclic graphs (DAGs) and introducing a DAG-pairwise evaluation framework. Building upon this foundation, they construct TRM-Preference, the first preference dataset tailored for complex reasoning, and train a Thinking Reward Model to enable scalable assessment and optimization of reasoning quality. Experimental results demonstrate that the proposed approach improves reasoning selection accuracy by 19.3% at test time and achieves up to a 3.9% gain in reasoning performance during reinforcement learning training, substantially enhancing the modelโ€™s multi-task reasoning capabilities.

Technology Category

Application Category

๐Ÿ“ Abstract
Large Reasoning Models (LRMs) increasingly rely on reasoning traces with complex internal structures. However, existing work lacks a unified answer to three fundamental questions: (1) what defines high-quality reasoning, (2) how to reliably evaluate long, implicitly structured reasoning traces, and (3) how to use such evaluation signals for reasoning optimization. To address these challenges, we provide a unified perspective. (1) We introduce the ME$^2$ principle to characterize reasoning quality along macro- and micro-level concerning efficiency and effectiveness. (2) Built on this principle, we model reasoning traces as directed acyclic graphs (DAGs) and develop a DAG-based pairwise evaluation method, capturing complex reasoning structures. (3) Based on this method, we construct the TRM-Preference dataset and train a Thinking Reward Model (TRM) to evaluate reasoning quality at scale. Experiments show that thinking rewards serve as an effective optimization signal. At test time, selecting better reasoning leads to better outcomes (up to 19.3% gain), and during RL training, thinking rewards enhance reasoning and performance (up to 3.9% gain) across diverse tasks.
Problem

Research questions and friction points this paper is trying to address.

complex reasoning
reasoning evaluation
reasoning optimization
reasoning traces
large reasoning models
Innovation

Methods, ideas, or system contributions that make the work stand out.

MEยฒ principle
reasoning trace evaluation
directed acyclic graph (DAG)
Thinking Reward Model (TRM)
reasoning optimization
๐Ÿ”Ž Similar Papers
No similar papers found.