On the Shelf Life of Fine-Tuned LLM Judges: Future Proofing, Backward Compatibility, and Question Generalization

📅 2025-09-27

📈 Citations: 0

✨ Influential: 0

career value

183K/year

🤖 AI Summary

This work systematically investigates the long-term practicality of LLM-based evaluators, addressing three core challenges: future scalability (assessing responses from future models), backward compatibility (evaluating responses from historical models), and question generalization (discriminating responses to unseen questions). We formally define the first two properties for the first time and propose a continual learning–based training framework. Using mathematics as a unified evaluation domain, we benchmark both supervised fine-tuning (SFT) and direct preference optimization (DPO) across three foundational LLMs. Results show that DPO substantially improves backward compatibility but offers limited gains in future scalability; all evaluators exhibit performance degradation on unseen questions. This study provides a theoretical characterization of evaluator robustness, a principled methodological framework, and an empirical benchmark for the reliable deployment of evaluation models.

Technology Category

Application Category

📝 Abstract

The LLM-as-a-judge paradigm is widely used in both evaluating free-text model responses and reward modeling for model alignment and finetuning. Recently, finetuning judges with judge-specific data has emerged as an often preferred choice over directly prompting frontier models as judges, as the former achieves better performance with smaller model sizes while being more robust to common biases. However, the standard evaluation ignores several practical concerns of finetuned judges regarding their real world deployment. In this paper, we identify and formalize three aspects that affect the shelf life of these judges: future proofing and backward compatibility -- how well judges finetuned on responses by today's generator models perform on responses by future models or past models, as well as question generalization -- how well judges generalize to unseen questions at test time. We study these three aspects in the math domain under a unified framework with varying train and test distributions, three SFT- and DPO-based finetuning algorithms and three different base models. Experiments suggest that future-proofing is challenging for most models, while backward compatibility is relatively easy, with DPO-trained models consistently improving performance. We further find that continual learning provides a more balanced adaptation to shifts between older and newer response distributions than training solely on stronger or weaker responses. Moreover, all models observe certain degrees of performance degradation when moving from questions seen during training to unseen ones, showing that current judges do not fully generalize to unseen questions. These findings provide insights into practical considerations for developing and deploying judge models in the face of ever-changing generators.

Problem

Research questions and friction points this paper is trying to address.

Evaluating fine-tuned LLM judges' performance on future and past model responses

Assessing judge generalization to unseen questions during test time

Studying shelf life of fine-tuned judges with varying training distributions

Innovation

Methods, ideas, or system contributions that make the work stand out.

Fine-tuned LLM judges use judge-specific data training

DPO-trained models enhance backward compatibility performance

Continual learning balances adaptation to response distribution shifts

🔎 Similar Papers

An Empirical Study of LLM-as-a-Judge for LLM Evaluation: Fine-tuned Judge Model is not a General Substitute for GPT-4