🤖 AI Summary
This work addresses the frequent miscalibration of language models when generating outputs under user-specified stochasticity constraints, where the model’s output distribution often deviates from the intended target distribution. The study provides the first systematic demonstration that probability calibration can be effectively learned through fine-tuning and introduces two complementary strategies: soft-target fine-tuning using synthetic prompts combined with a trie structure, and hard-target fine-tuning based on samples drawn directly from the target distribution. Experiments across twelve models show that both approaches substantially improve sampling fidelity for unseen distributions and parameter settings—soft targets excel in open-ended stochastic generation tasks, while hard targets achieve higher precision in numerical sampling. Notably, some models exhibit degraded arithmetic reasoning performance, revealing a potential trade-off between calibration and reasoning capabilities.
📝 Abstract
Language models are increasingly used in settings where outputs must satisfy user-specified randomness constraints, yet their generation probabilities are often poorly calibrated to those targets. We study whether this capability can be improved directly through fine-tuning. Concretely, we fine-tune language models on synthetic prompts that require sampling from mathematical distributions, and compare two Calibration Fine-Tuning variants: a soft-target method that converts the desired output distribution into trie-derived next-token targets, and a hard-target method that trains on sampled completions from the same target distribution. Across 12 models spanning four families, both methods substantially improve structured-sampling fidelity on held-out distribution families and unseen parameter settings, showing that probabilistic calibration is a trainable capability. Under our selected training configurations, the two methods exhibit different empirical profiles: hard-target fine-tuning is often strongest on structured numeric sampling, while soft-target fine-tuning performs better on broader stochastic generation benchmarks, including open-ended random generation, multiple-choice answer-position balancing, and NoveltyBench. The gains sometimes reduce downstream capability, especially arithmetic reasoning, with costs varying by model. Overall, our results show that probabilistic calibration can be improved through fine-tuning, with our hard-target configuration favoring exact numeric fidelity and our soft-target configuration favoring broader stochastic transfer. Code is available at https://github.com/chandar-lab/calibration-finetuning.