Through the Judge's Eyes: Inferred Thinking Traces Improve Reliability of LLM Raters

📅 2025-10-29
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address the low consistency between large language models (LLMs) and human judgments in subjective evaluation tasks, this paper proposes a human-AI collaborative framework. Given only sparse raw label data, it employs rejection sampling to inversely infer the human judgment’s underlying *tree-of-thought* (ToT) reasoning traces, thereby constructing a high-quality chain-of-thought (CoT)-enhanced dataset—without manual CoT annotation. Subsequently, open-source LLMs undergo CoT fine-tuning, while structured annotation guidelines are generated for closed-source LLMs. The method significantly improves LLM–human agreement (average +12.3% Kendall’s τ) and cross-model scoring consensus, achieving state-of-the-art performance on multiple subjective evaluation benchmarks (e.g., SummEval, PEX). Its core innovation lies in scalable, faithful reconstruction of trustworthy ToT traces directly from label-only data, enabling low-cost, high-fidelity alignment with human judgments.

Technology Category

Application Category

📝 Abstract
Large language models (LLMs) are increasingly used as raters for evaluation tasks. However, their reliability is often limited for subjective tasks, when human judgments involve subtle reasoning beyond annotation labels. Thinking traces, the reasoning behind a judgment, are highly informative but challenging to collect and curate. We present a human-LLM collaborative framework to infer thinking traces from label-only annotations. The proposed framework uses a simple and effective rejection sampling method to reconstruct these traces at scale. These inferred thinking traces are applied to two complementary tasks: (1) fine-tuning open LLM raters; and (2) synthesizing clearer annotation guidelines for proprietary LLM raters. Across multiple datasets, our methods lead to significantly improved LLM-human agreement. Additionally, the refined annotation guidelines increase agreement among different LLM models. These results suggest that LLMs can serve as practical proxies for otherwise unrevealed human thinking traces, enabling label-only corpora to be extended into thinking-trace-augmented resources that enhance the reliability of LLM raters.
Problem

Research questions and friction points this paper is trying to address.

Inferring thinking traces from label-only annotations
Improving LLM rater reliability for subjective tasks
Enhancing LLM-human agreement through reconstructed reasoning traces
Innovation

Methods, ideas, or system contributions that make the work stand out.

Infer thinking traces from label-only annotations
Use rejection sampling to reconstruct reasoning traces
Apply inferred traces to fine-tune and guide LLMs
🔎 Similar Papers
No similar papers found.
X
Xingjian Zhang
University of Michigan, Ann Arbor, Michigan, USA
T
Tianhong Gao
University of Michigan, Ann Arbor, Michigan, USA
S
Suliang Jin
University of Michigan, Ann Arbor, Michigan, USA
T
Tianhao Wang
University of California, San Diego, San Diego, California, USA
T
Teng Ye
University of Minnesota, Twin Cities, Minneapolis, Minnesota, USA
Eytan Adar
Eytan Adar
Professor, University of Michigan
Human Computer InteractionHCIHuman-AI InteractionVisualizationSocial Media
Qiaozhu Mei
Qiaozhu Mei
Professor, University of Michigan
AIdata mininginformation retrievalnatural language processinghealth informatics