Investigating the Multilingual Calibration Effects of Language Model Instruction-Tuning

📅 2026-01-04

🏛️ arXiv.org

📈 Citations: 0

✨ Influential: 0

career value

171K/year

🤖 AI Summary

This work investigates the under-calibration commonly observed in multilingual large language models after instruction fine-tuning, particularly manifesting as overconfidence without corresponding gains in accuracy for low-resource languages. The study systematically analyzes how supervised fine-tuning affects multilingual calibration performance and uncovers a cross-lingual calibration imbalance induced by the dominance of high-resource language data during transfer. To address this issue, the authors propose a general-purpose calibration strategy that requires no annotated data for low-resource languages, integrating label smoothing to improve calibration consistency. Evaluated on benchmarks covering 29 and 42 languages, the method demonstrably mitigates overconfidence in low-resource settings and enhances cross-lingual calibration alignment, offering a practical solution to a critical challenge in multilingual model deployment.

Technology Category

Application Category

📝 Abstract

Ensuring that deep learning models are well-calibrated in terms of their predictive uncertainty is essential in maintaining their trustworthiness and reliability, yet despite increasing advances in foundation model research, the relationship between such large language models (LLMs) and their calibration remains an open area of research. In this work, we look at a critical gap in the calibration of LLMs within multilingual settings, in an attempt to better understand how the data scarcity can potentially lead to different calibration effects and how commonly used techniques can apply in these settings. Our analysis on two multilingual benchmarks, over 29 and 42 languages respectively, reveals that even in low-resource languages, model confidence can increase significantly after instruction-tuning on high-resource language SFT datasets. However, improvements in accuracy are marginal or non-existent, resulting in mis-calibration, highlighting a critical shortcoming of standard SFT for multilingual languages. Furthermore, we observe that the use of label smoothing to be a reasonable method alleviate this concern, again without any need for low-resource SFT data, maintaining better calibration across all languages. Overall, this highlights the importance of multilingual considerations for both training and tuning LLMs in order to improve their reliability and fairness in downstream use.

Problem

Research questions and friction points this paper is trying to address.

multilingual calibration

instruction-tuning

large language models

model confidence

low-resource languages

Innovation

Methods, ideas, or system contributions that make the work stand out.

multilingual calibration

instruction-tuning

label smoothing