TRACT: Regression-Aware Fine-tuning Meets Chain-of-Thought Reasoning for LLM-as-a-Judge

📅 2025-03-06
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing LLM-as-a-judge approaches predominantly rely on cross-entropy loss for fine-tuning, neglecting the numerical continuity of scores and the interpretability of scoring rationale. To address these limitations, we propose a two-stage regression-aware chain-of-thought (CoT) reasoning framework. In Stage I, CoT supervision is introduced to enhance the interpretability of score justification. In Stage II, CoT generation and regression-aware score prediction are jointly optimized, achieving the first deep integration of CoT supervision with regression losses—specifically, weighted combinations of mean squared error (MSE) and cross-entropy (CE). Our method achieves significant improvements over state-of-the-art baselines across four mainstream LLM-as-a-judge benchmarks and two large language model families. Ablation studies confirm the necessity and effectiveness of both the two-stage collaborative training strategy and the joint modeling of regression and reasoning.

Technology Category

Application Category

📝 Abstract
The LLM-as-a-judge paradigm uses large language models (LLMs) for automated text evaluation, where a numerical assessment is assigned by an LLM to the input text following scoring rubrics. Existing methods for LLM-as-a-judge use cross-entropy (CE) loss for fine-tuning, which neglects the numeric nature of score prediction. Recent work addresses numerical prediction limitations of LLM fine-tuning through regression-aware fine-tuning, which, however, does not consider chain-of-thought (CoT) reasoning for score prediction. In this paper, we introduce TRACT (Two-stage Regression-Aware fine-tuning with CoT), a method combining CoT reasoning with regression-aware training. TRACT consists of two stages: first, seed LLM is fine-tuned to generate CoTs, which serve as supervision for the second stage fine-tuning. The training objective of TRACT combines the CE loss for learning the CoT reasoning capabilities, and the regression-aware loss for the score prediction. Experiments across four LLM-as-a-judge datasets and two LLMs show that TRACT significantly outperforms existing methods. Extensive ablation studies validate the importance of each component in TRACT.
Problem

Research questions and friction points this paper is trying to address.

Improves LLM score prediction accuracy
Integrates chain-of-thought reasoning
Combines regression-aware fine-tuning with CoT
Innovation

Methods, ideas, or system contributions that make the work stand out.

Combines CoT reasoning with regression-aware training
Two-stage fine-tuning for enhanced score prediction
Integrates CE loss and regression-aware loss objectives
🔎 Similar Papers
No similar papers found.