đ€ AI Summary
Language models (LMs) encode substantial factual knowledge, yet conventional evaluationsâoverly stringent in surface-form matchingâunderestimate their parametric knowledge capabilities due to format mismatch between model outputs and canonical answers.
Method: We identify this evaluation bias and propose Retrieval-Constrained Decoding (RCD): integrating the YAGO-QA benchmark with retrieval-augmented generation to impose factual consistency constraints during decoding.
Contribution/Results: Experiments demonstrate that output format significantly impacts factual knowledge identification accuracy; RCD effectively mitigates misclassification of smaller models arising from expressive variation. Systematic evaluation across open-source LMs of multiple scales shows RCD boosts F1 scores from 32.3% to 46.0% for Llama-3.1-70B and achieves 33.0% for the 8B variantâsurpassing unconstrained larger models. This work provides the first systematic evidence of format-induced bias in factual knowledge evaluation and introduces a scalable constrained-decoding framework for fairer, more accurate assessment of LM factual competence.
đ Abstract
Language models (LMs) encode substantial factual knowledge, but often produce answers judged as incorrect. We hypothesize that many of these answers are actually correct, but are expressed in alternative surface forms that are dismissed due to an overly strict evaluation, leading to an underestimation of models' parametric knowledge. We propose Retrieval-Constrained Decoding (RCD), a decoding strategy that restricts model outputs to unique surface forms. We introduce YAGO-QA, a dataset of 19,137 general knowledge questions. Evaluating open-source LMs from 135M to 70B parameters, we show that standard decoding undervalues their knowledge. For instance, Llama-3.1-70B scores only 32.3% F1 with vanilla decoding but 46.0% with RCD. Similarly, Llama-3.1-8B reaches 33.0% with RCD, outperforming the larger model under vanilla decoding. We publicly share the code and dataset at https://github.com/Rajjaa/disambiguated-LLM.