The Necessity of Setting Temperature in LLM-as-a-Judge

📅 2026-03-30
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Current practices in LLM-as-a-Judge commonly employ fixed temperature settings—such as 0.1 or 1.0—yet the validity of these choices lacks systematic empirical validation. This work introduces, for the first time, a causal inference framework that integrates controlled experiments with statistical analysis to rigorously investigate the causal effect of temperature on the evaluation performance of large language models. The study reveals that temperature exerts a significant and task-dependent influence on judgment quality, challenging the prevailing assumption that low temperatures are universally optimal. By moving beyond correlational observations that have dominated prior research, this paper provides robust empirical evidence and practical guidance for temperature selection in LLM-based evaluation systems.
📝 Abstract
LLM-as-a-Judge has emerged as an effective and low-cost paradigm for evaluating text quality and factual correctness. Prior studies have shown substantial agreement between LLM judges and human experts, even on tasks that are difficult to assess automatically. In practice, researchers commonly employ fixed temperature configurations during the evaluation process-with values of 0.1 and 1.0 being the most prevalent choices-a convention that is largely empirical rather than principled. However, recent researches suggest that LLM performance exhibits non-trivial sensitivity to temperature settings, that lower temperatures do not universally yield optimal outcomes, and that such effects are highly task-dependent. This raises a critical research question: does temperature influence judge performance in LLM centric evaluation? To address this, we systematically investigate the relationship between temperature and judge performance through a series of controlled experiments, and further adopt a causal inference framework within our empirical statistical analysis to rigorously examine the direct causal effect of temperature on judge behavior, offering actionable engineering insights for the design of LLM-centric evaluation pipelines.
Problem

Research questions and friction points this paper is trying to address.

LLM-as-a-Judge
temperature
evaluation
judge performance
text quality
Innovation

Methods, ideas, or system contributions that make the work stand out.

temperature sensitivity
LLM-as-a-Judge
causal inference
evaluation pipeline
controlled experiments
🔎 Similar Papers
No similar papers found.