Audited Reasoning Refinement: Fine-Tuning Language Models via LLM-Guided Step-Wise Evaluation and Correction

📅 2025-09-15

📈 Citations: 0

✨ Influential: 0

career value

176K/year

🤖 AI Summary

To address the scarcity of high-quality supervision signals for small reasoning models in data-scarce settings, this paper proposes a large-language-model (LLM)-guided two-stage alignment framework. In Stage I, an LLM drives iterative error detection and correction to generate high-fidelity reasoning traces. In Stage II, joint optimization of intermediate reasoning paths and final outputs enables explicit calibration of the reasoning process. The method integrates initial reasoning generation, supervised fine-tuning (SFT), and direct preference optimization (DPO). We construct a high-quality dataset of 600 instances for the EERD graph evaluation task; experiments show substantial improvements in small-model reasoning accuracy. Crucially, the framework balances interpretability and generalizability, demonstrating strong reproducibility and scalability in low-resource domains such as education.

Technology Category

Application Category

📝 Abstract

Training a task-specific small reasoning model is challenging when direct human supervision or high-quality labels are scarce. However, LLMs with reasoning capabilities produce abundant intermediate reasoning traces that can be systematically refined to create effective supervision signals. We propose Reason-Refine-then-Align (R2tA), which turns refined model rationales into supervision for training task-specific reasoning models. Our method generates initial reasoning and responses from an open-source base model on task-specific inputs, then refines these traces, fixing hallucinations and inconsistencies, to form a high-fidelity dataset. We perform a two-stage alignment, supervised fine-tuning (SFT), followed by direct preference optimization (DPO) to calibrate the model's intermediate reasoning with human-validated conceptual preferences and then condition the final output on that aligned reasoning. As a case study, we apply R2tA to evaluate extended entity relationship diagrams (EERDs) in database system design, a structurally complex task where prompt-only methods miss or hallucinate errors. We curated a dataset of 600 EERD variants (train/test split of 450/150, respectively) with induced mistakes spanning 11 categories. Empirical evaluation suggests R2tA provides a practical, cost-effective path to scalable LLM adaptation in data-scarce domains, enabling reproducible AI tools for education and beyond.

Problem

Research questions and friction points this paper is trying to address.

Refining LLM reasoning traces to create supervision signals

Training task-specific reasoning models without human supervision

Correcting hallucinations in complex structural evaluation tasks

Innovation

Methods, ideas, or system contributions that make the work stand out.

LLM-guided step-wise evaluation and correction

Two-stage alignment with SFT and DPO

Refined rationales as supervision for training

🔎 Similar Papers

Semantic Self-Consistency: Enhancing Language Model Reasoning via Semantic Weighting