TRACT: Regression-Aware Fine-tuning Meets Chain-of-Thought Reasoning for LLM-as-a-Judge

📅 2025-03-06

📈 Citations: 0

✨ Influential: 0

career value

172K/year

🤖 AI Summary

Existing LLM-as-a-judge approaches predominantly rely on cross-entropy loss for fine-tuning, neglecting the numerical continuity of scores and the interpretability of scoring rationale. To address these limitations, we propose a two-stage regression-aware chain-of-thought (CoT) reasoning framework. In Stage I, CoT supervision is introduced to enhance the interpretability of score justification. In Stage II, CoT generation and regression-aware score prediction are jointly optimized, achieving the first deep integration of CoT supervision with regression losses—specifically, weighted combinations of mean squared error (MSE) and cross-entropy (CE). Our method achieves significant improvements over state-of-the-art baselines across four mainstream LLM-as-a-judge benchmarks and two large language model families. Ablation studies confirm the necessity and effectiveness of both the two-stage collaborative training strategy and the joint modeling of regression and reasoning.

Technology Category

Application Category

📝 Abstract

The LLM-as-a-judge paradigm uses large language models (LLMs) for automated text evaluation, where a numerical assessment is assigned by an LLM to the input text following scoring rubrics. Existing methods for LLM-as-a-judge use cross-entropy (CE) loss for fine-tuning, which neglects the numeric nature of score prediction. Recent work addresses numerical prediction limitations of LLM fine-tuning through regression-aware fine-tuning, which, however, does not consider chain-of-thought (CoT) reasoning for score prediction. In this paper, we introduce TRACT (Two-stage Regression-Aware fine-tuning with CoT), a method combining CoT reasoning with regression-aware training. TRACT consists of two stages: first, seed LLM is fine-tuned to generate CoTs, which serve as supervision for the second stage fine-tuning. The training objective of TRACT combines the CE loss for learning the CoT reasoning capabilities, and the regression-aware loss for the score prediction. Experiments across four LLM-as-a-judge datasets and two LLMs show that TRACT significantly outperforms existing methods. Extensive ablation studies validate the importance of each component in TRACT.

Problem

Research questions and friction points this paper is trying to address.

Improves LLM score prediction accuracy

Integrates chain-of-thought reasoning

Combines regression-aware fine-tuning with CoT

Innovation

Methods, ideas, or system contributions that make the work stand out.

Combines CoT reasoning with regression-aware training

Two-stage fine-tuning for enhanced score prediction

Integrates CE loss and regression-aware loss objectives

🔎 Similar Papers

Leveraging Large Language Models for Relevance Judgments in Legal Case Retrieval