CLARity: Reasoning Consistency Alone Can Teach Reinforced Experts

📅 2025-10-10
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
In few-shot domain-expert large language model (LLM) training, standard outcome-oriented reinforcement learning (RL) improves accuracy but degrades logical consistency. To address this, we propose a consistency-aware two-stage training framework: (1) a lightweight process reward model (PRM) is distilled from a small general-purpose LLM to evaluate reasoning steps; (2) dynamic data reconstruction and step-wise RL jointly optimize reasoning paths. This avoids the high computational cost of large-scale PRMs and is the first method to effectively enhance reasoning consistency in expert LLMs using only small models. Experiments show +16.5% improvement in logical consistency and +7.5% in accuracy; human evaluation confirms significant gains in answer coherence and domain expertise. Our core contribution is a low-cost, robust, consistency-driven training paradigm that advances trustworthy reasoning modeling under data-scarce conditions.

Technology Category

Application Category

📝 Abstract
Training expert LLMs in domains with scarce data is difficult, often relying on multiple-choice questions (MCQs). However, standard outcome-based reinforcement learning (RL) on MCQs is risky. While it may improve accuracy, we observe it often degrades reasoning quality such as logical consistency. Existing solutions to supervise reasoning, such as large-scale Process Reward Models (PRMs), are prohibitively expensive. To address this, we propose CLARity, a cost-effective RL framework that enhances reasoning quality using only a small, general-purpose LLM. CLARity integrates a consistency-aware reward mechanism with a 2-stage refine-then-monitor training pipeline to enhance reasoning consistency, and a dynamic data reformulation strategy to to better exploit limited data. Experiments demonstrate that CLARity improves response consistency by 16.5% and accuracy by 7.5% over baselines. Human evaluations further confirm holistic improvements in coherence and professionalism. Thus, CLARity offers a generalizable solution that enables smaller models to effectively guide expert models by reasoning consistency.Our code is open sourced at: https://github.com/Infinite-set/CLARity
Problem

Research questions and friction points this paper is trying to address.

Improving reasoning consistency in expert LLMs with scarce data
Reducing reliance on expensive process supervision methods
Enhancing accuracy and logical coherence through consistency rewards
Innovation

Methods, ideas, or system contributions that make the work stand out.

Uses small general-purpose LLM for cost-effective training
Implements consistency-aware reward with refine-monitor pipeline
Applies dynamic data reformulation to maximize limited data
Jiuheng Lin
Jiuheng Lin
Peking University
Natural Language Processing
C
Cong Jiang
Peking University
Z
Zirui Wu
Peking University
J
Jiarui Sun
Peking University
Yansong Feng
Yansong Feng
Peking University
Natural Language ProcessingPattern Recognition