Noise-Response Calibration: A Causal Intervention Protocol for LLM-Judges

📅 2026-03-17

📈 Citations: 0

✨ Influential: 0

career value

177K/year

🤖 AI Summary

This work addresses the unreliability of large language models (LLMs) as automatic evaluators, particularly in label-scarce settings where stochasticity and overconfidence hinder consistent deployment. To tackle this, the authors propose a reproducible calibration method grounded in causal intervention: controlled noise is introduced via signal-to-noise ratio perturbations for tabular data and lexical perturbations for textual inputs, and hypothesis testing is performed based on performance degradation slopes observed across repeated trials. Experiments on the UCI tabular benchmark and four text classification datasets reveal a clear modality gap—tabular evaluators exhibit low sensitivity to noise and consistently weaker performance, whereas textual evaluators display predictable degradation patterns. Leveraging these insights, the study establishes a standardized reporting protocol for calibrating LLM-based evaluators under distributional shifts.

Technology Category

Application Category

📝 Abstract

Large language models (LLMs) are increasingly used as automated judges and synthetic labelers, especially in low-label settings. Yet these systems are stochastic and often overconfident, which makes deployment decisions difficult when external ground truth is limited. We propose a practical calibration protocol based on controlled input interventions: if noise severity increases, task performance should exhibit a statistically significant deterioration trend. We operationalize this with a slope-based hypothesis test over repeated trials, using signal-to-noise-ratio (SNR) perturbations for tabular data and lexical perturbations for text data. Across UCI tabular benchmarks and four text classification datasets, we find clear modality-dependent behavior. Our results reveal a modality gap: while text-based judges degrade predictably, the majority of tabular datasets show a lack of statistically significant performance deterioration even under significant signal-to-noise reduction. Interestingly we find that model performance is lower on datasets that are insensitive to noise interventions. We present a reproducible methodology and reporting protocol for robust LLM-judge calibration under distribution shift.

Problem

Research questions and friction points this paper is trying to address.

LLM-judges

calibration

noise-response

distribution shift

overconfidence

Innovation

Methods, ideas, or system contributions that make the work stand out.

Noise-Response Calibration

Causal Intervention

LLM-Judges