Exploring the Impact of Temperature on Large Language Models:Hot or Cold?

📅 2025-06-08
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Prior work lacks systematic understanding of how sampling temperature (0–4.0) affects the capabilities of open-source LLMs across scales. Method: We conduct statistical analysis and SuperGLUE benchmarking on small-, medium-, and large-scale LLMs across six capability dimensions, identifying a scale-dependent “critical temperature” threshold. We propose the first task-adaptive, BERT-based dynamic temperature selector—departing from fixed-temperature paradigms—and evaluate its robustness under FP16 and 4-bit quantization. Contribution/Results: Our method significantly improves average SuperGLUE performance for small and medium models (+2.1–3.7 points), precisely characterizes optimal temperature intervals and critical thresholds per model scale, and establishes an interpretable, deployable framework for temperature-aware controllable generation in LLMs.

Technology Category

Application Category

📝 Abstract
The sampling temperature, a critical hyperparameter in large language models (LLMs), modifies the logits before the softmax layer, thereby reshaping the distribution of output tokens. Recent studies have challenged the Stochastic Parrots analogy by demonstrating that LLMs are capable of understanding semantics rather than merely memorizing data and that randomness, modulated by sampling temperature, plays a crucial role in model inference. In this study, we systematically evaluated the impact of temperature in the range of 0 to 2 on data sets designed to assess six different capabilities, conducting statistical analyses on open source models of three different sizes: small (1B--4B), medium (6B--13B), and large (40B--80B). Our findings reveal distinct skill-specific effects of temperature on model performance, highlighting the complexity of optimal temperature selection in practical applications. To address this challenge, we propose a BERT-based temperature selector that takes advantage of these observed effects to identify the optimal temperature for a given prompt. We demonstrate that this approach can significantly improve the performance of small and medium models in the SuperGLUE datasets. Furthermore, our study extends to FP16 precision inference, revealing that temperature effects are consistent with those observed in 4-bit quantized models. By evaluating temperature effects up to 4.0 in three quantized models, we find that the Mutation Temperature -- the point at which significant performance changes occur -- increases with model size.
Problem

Research questions and friction points this paper is trying to address.

Evaluating temperature impact on LLM performance across six capabilities
Proposing BERT-based selector for optimal temperature in practical applications
Analyzing temperature effects in FP16 and quantized models consistently
Innovation

Methods, ideas, or system contributions that make the work stand out.

BERT-based temperature selector optimizes prompt performance
Evaluates temperature effects across different model sizes
FP16 precision inference aligns with quantized model results
🔎 Similar Papers
No similar papers found.