Retrieval-Constrained Decoding Reveals Underestimated Parametric Knowledge in Language Models

📅 2025-09-27

📈 Citations: 0

✨ Influential: 0

career value

151K/year

🤖 AI Summary

Language models (LMs) encode substantial factual knowledge, yet conventional evaluations—overly stringent in surface-form matching—underestimate their parametric knowledge capabilities due to format mismatch between model outputs and canonical answers. Method: We identify this evaluation bias and propose Retrieval-Constrained Decoding (RCD): integrating the YAGO-QA benchmark with retrieval-augmented generation to impose factual consistency constraints during decoding. Contribution/Results: Experiments demonstrate that output format significantly impacts factual knowledge identification accuracy; RCD effectively mitigates misclassification of smaller models arising from expressive variation. Systematic evaluation across open-source LMs of multiple scales shows RCD boosts F1 scores from 32.3% to 46.0% for Llama-3.1-70B and achieves 33.0% for the 8B variant—surpassing unconstrained larger models. This work provides the first systematic evidence of format-induced bias in factual knowledge evaluation and introduces a scalable constrained-decoding framework for fairer, more accurate assessment of LM factual competence.

Technology Category

Application Category

📝 Abstract

Language models (LMs) encode substantial factual knowledge, but often produce answers judged as incorrect. We hypothesize that many of these answers are actually correct, but are expressed in alternative surface forms that are dismissed due to an overly strict evaluation, leading to an underestimation of models' parametric knowledge. We propose Retrieval-Constrained Decoding (RCD), a decoding strategy that restricts model outputs to unique surface forms. We introduce YAGO-QA, a dataset of 19,137 general knowledge questions. Evaluating open-source LMs from 135M to 70B parameters, we show that standard decoding undervalues their knowledge. For instance, Llama-3.1-70B scores only 32.3% F1 with vanilla decoding but 46.0% with RCD. Similarly, Llama-3.1-8B reaches 33.0% with RCD, outperforming the larger model under vanilla decoding. We publicly share the code and dataset at https://github.com/Rajjaa/disambiguated-LLM.

Problem

Research questions and friction points this paper is trying to address.

Evaluating language models' factual knowledge with strict criteria

Proposing constrained decoding to reveal underestimated parametric knowledge

Addressing surface form variability in model answer evaluation

Innovation

Methods, ideas, or system contributions that make the work stand out.

Retrieval-Constrained Decoding restricts model output forms

Uses YAGO-QA dataset with 19,137 knowledge questions

Shows decoding strategy improves parametric knowledge evaluation

🔎 Similar Papers

No similar papers found.