From Parameters to Prompts: Understanding and Mitigating the Factuality Gap between Fine-Tuned LLMs

📅 2025-05-29

📈 Citations: 0

✨ Influential: 0

career value

180K/year

🤖 AI Summary

This paper addresses the factual inconsistency gap in fine-tuned large language models (LLMs) between known (in-distribution) and unknown (out-of-distribution) knowledge. Methodologically, it establishes— for the first time at the theoretical level—that test-time prompting techniques (e.g., in-context learning [ICL] and chain-of-thought [CoT]) can attenuate or even override the influence of fine-tuning data; it further introduces the “prompt dominance” theory, redefining evaluation criteria for fine-tuning data. Empirical results demonstrate that ICL and CoT improve factual accuracy by over 35% on out-of-distribution knowledge tasks. The core contribution lies in uncovering the compensatory mechanism whereby prompt engineering mitigates fine-tuning biases, thereby establishing a novel paradigm—“prompting to compensate for fine-tuning deficiencies.” This work provides interpretable, quantifiable theoretical principles and practical guidelines for synergistically optimizing fine-tuning and reasoning.

Technology Category

Application Category

📝 Abstract

Factual knowledge extraction aims to explicitly extract knowledge parameterized in pre-trained language models for application in downstream tasks. While prior work has been investigating the impact of supervised fine-tuning data on the factuality of large language models (LLMs), its mechanism remains poorly understood. We revisit this impact through systematic experiments, with a particular focus on the factuality gap that arises when fine-tuning on known versus unknown knowledge. Our findings show that this gap can be mitigated at the inference stage, either under out-of-distribution (OOD) settings or by using appropriate in-context learning (ICL) prompts (i.e., few-shot learning and Chain of Thought (CoT)). We prove this phenomenon theoretically from the perspective of knowledge graphs, showing that the test-time prompt may diminish or even overshadow the impact of fine-tuning data and play a dominant role in knowledge extraction. Ultimately, our results shed light on the interaction between finetuning data and test-time prompt, demonstrating that ICL can effectively compensate for shortcomings in fine-tuning data, and highlighting the need to reconsider the use of ICL prompting as a means to evaluate the effectiveness of fine-tuning data selection methods.

Problem

Research questions and friction points this paper is trying to address.

Understanding the factuality gap in fine-tuned LLMs

Mitigating factuality gap via inference-stage strategies

Exploring interaction between fine-tuning data and prompts

Innovation

Methods, ideas, or system contributions that make the work stand out.

Mitigate factuality gap via inference-stage adjustments

Use in-context learning prompts for knowledge extraction

Test-time prompts dominate fine-tuning data impact

🔎 Similar Papers

Factual consistency evaluation of summarization in the Era of large language models