🤖 AI Summary
This work addresses the suboptimal decoding decisions in large language models arising from the mismatch between predicted and true generation distributions. To mitigate this issue, the authors propose a task-aware calibration method that adjusts the model’s output distribution within a task-induced semantic latent space, integrated with Minimum Bayes Risk (MBR) decoding to yield improved decision-making. The study introduces, for the first time, calibration into task-specific latent semantic structures, establishing a new paradigm of task-aware calibration and proposing a task-oriented evaluation metric—Task Calibration Error (TCE). Experimental results demonstrate consistent improvements in generation quality across diverse tasks and model baselines, significantly enhancing the reliability of model decisions.
📝 Abstract
LLM decoding often relies on the model's predictive distribution to generate an output. Consequently, misalignment with respect to the true generating distribution leads to suboptimal decisions in practice. While a natural solution is to calibrate the model's output distribution, for LLMs, this is ill-posed at the combinatorially vast level of free-form language. We address this by building on the insight that in many tasks, these free-form outputs can be interpreted in a semantically meaningful latent structure, for example, discrete class labels, integers, or sets. We introduce task calibration as a paradigm to calibrate the model's predictive distribution in the task-induced latent space. We apply a decision-theoretic result to show that Minimum Bayes Risk (MBR) decoding on the task-calibrated latent distribution is the optimal decoding strategy on latent model beliefs. Empirically, it consistently improves generation quality across different tasks and baselines. We also introduce Task Calibration Error (TCE), an application-aware calibration metric that quantifies the excess loss due to miscalibration. Our work demonstrates that task calibration enables more reliable model decisions across various tasks and applications.