🤖 AI Summary
Existing LLM-as-a-judge approaches predominantly rely on cross-entropy loss for fine-tuning, neglecting the numerical continuity of scores and the interpretability of scoring rationale. To address these limitations, we propose a two-stage regression-aware chain-of-thought (CoT) reasoning framework. In Stage I, CoT supervision is introduced to enhance the interpretability of score justification. In Stage II, CoT generation and regression-aware score prediction are jointly optimized, achieving the first deep integration of CoT supervision with regression losses—specifically, weighted combinations of mean squared error (MSE) and cross-entropy (CE). Our method achieves significant improvements over state-of-the-art baselines across four mainstream LLM-as-a-judge benchmarks and two large language model families. Ablation studies confirm the necessity and effectiveness of both the two-stage collaborative training strategy and the joint modeling of regression and reasoning.
📝 Abstract
The LLM-as-a-judge paradigm uses large language models (LLMs) for automated text evaluation, where a numerical assessment is assigned by an LLM to the input text following scoring rubrics. Existing methods for LLM-as-a-judge use cross-entropy (CE) loss for fine-tuning, which neglects the numeric nature of score prediction. Recent work addresses numerical prediction limitations of LLM fine-tuning through regression-aware fine-tuning, which, however, does not consider chain-of-thought (CoT) reasoning for score prediction. In this paper, we introduce TRACT (Two-stage Regression-Aware fine-tuning with CoT), a method combining CoT reasoning with regression-aware training. TRACT consists of two stages: first, seed LLM is fine-tuned to generate CoTs, which serve as supervision for the second stage fine-tuning. The training objective of TRACT combines the CE loss for learning the CoT reasoning capabilities, and the regression-aware loss for the score prediction. Experiments across four LLM-as-a-judge datasets and two LLMs show that TRACT significantly outperforms existing methods. Extensive ablation studies validate the importance of each component in TRACT.