Mitigating Label Length Bias in Large Language Models

📅 2025-11-18
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the systematic “label-length bias” in large language models (LLMs), wherein performance disparities arise across multi-token labels despite standard length normalization—a persistent issue in multi-candidate prediction. We propose Normalized Context Calibration (NCC), the first method to explicitly identify and model bias induced by multi-word labels. NCC introduces label-level probability normalization coupled with a context-aware calibration mechanism, enhancing prediction robustness and stability in zero-shot and few-shot settings. Extensive experiments demonstrate that NCC consistently outperforms existing approaches across multiple benchmark datasets and mainstream LLMs, yielding up to a 10-percentage-point improvement in F1 score. Moreover, NCC generalizes effectively to tasks such as multiple-choice question answering. By mitigating context-dependent label-length bias, NCC establishes a new paradigm for fair and accurate prediction in in-context learning.

Technology Category

Application Category

📝 Abstract
Large language models (LLMs) are powerful zero- and few-shot learners. However, when predicting over a set of candidate options, LLMs suffer from label biases, and existing calibration methods overlook biases arising from multi-token class labels. We tackle an issue we call label length bias, where labels of different lengths are treated inconsistently, even after standard length normalization. To mitigate it, we propose normalized contextual calibration (NCC), an effective method that normalizes and calibrates predictions at the full-label level. NCC achieves statistically significant improvements over prior approaches across multiple datasets and models, with gains of up to 10% F1. Moreover, NCC extends bias mitigation to broader tasks such as multiple-choice question answering. Our analysis shows that, when combined with in-context learning, NCC is less sensitive to few-shot example selection, requires fewer examples for competitive performance, and produces more reliable confidence estimates. These findings highlight the importance of mitigating full-label biases to improve the performance and robustness of LLM-based methods, particularly in real-world applications where class labels naturally consist of multiple tokens.
Problem

Research questions and friction points this paper is trying to address.

Mitigating label length bias in multi-token class label predictions
Addressing inconsistent treatment of different label lengths in LLMs
Improving calibration for full-label biases beyond standard normalization
Innovation

Methods, ideas, or system contributions that make the work stand out.

Normalized contextual calibration mitigates label length bias
Method calibrates predictions at full-label level
Improves LLM performance across datasets and models
🔎 Similar Papers
No similar papers found.