Fine-Tuning vs. RAG for Multi-Hop Question Answering with Novel Knowledge

📅 2026-01-11

🏛️ arXiv.org

📈 Citations: 0

✨ Influential: 0

career value

202K/year

🤖 AI Summary

This study addresses the challenge of incorporating time-sensitive, post-pretraining knowledge into large language models for multi-hop question answering. We systematically compare parametric (fine-tuning) and non-parametric (retrieval-augmented generation, RAG) approaches to knowledge injection across three open-source 7B-scale models. Evaluations are conducted on standard benchmarks as well as a newly constructed multi-hop QA dataset featuring 2024 time-sensitive questions. Our experiments reveal, for the first time, that RAG significantly outperforms unsupervised fine-tuning in handling temporal knowledge, while supervised fine-tuning achieves the highest overall accuracy, underscoring the value of task-specific training. This work provides empirical evidence and practical guidance for selecting effective strategies to update dynamic knowledge in language models.

Technology Category

Application Category

📝 Abstract

Multi-hop question answering is widely used to evaluate the reasoning capabilities of large language models (LLMs), as it requires integrating multiple pieces of supporting knowledge to arrive at a correct answer. While prior work has explored different mechanisms for providing knowledge to LLMs, such as finetuning and retrieval-augmented generation (RAG), their relative effectiveness for multi-hop question answering remains insufficiently understood, particularly when the required knowledge is temporally novel. In this paper, we systematically compare parametric and non-parametric knowledge injection methods for open-domain multi-hop question answering. We evaluate unsupervised fine-tuning (continual pretraining), supervised fine-tuning, and retrieval-augmented generation across three 7B-parameter open-source LLMs. Experiments are conducted on two benchmarks: QASC, a standard multi-hop science question answering dataset, and a newly constructed dataset of over 10,000 multi-hop questions derived from Wikipedia events in 2024, designed to test knowledge beyond the models'pretraining cutoff. Our results show that unsupervised fine-tuning provides only limited gains over base models, suggesting that continual pretraining alone is insufficient for improving multi-hop reasoning accuracy. In contrast, retrieval-augmented generation yields substantial and consistent improvements, particularly when answering questions that rely on temporally novel information. Supervised fine-tuning achieves the highest overall accuracy across models and datasets. These findings highlight fundamental differences in how knowledge injection mechanisms support multi-hop question answering and underscore the importance of retrieval-based methods when external or compositional knowledge is required.

Problem

Research questions and friction points this paper is trying to address.

multi-hop question answering

novel knowledge

fine-tuning

retrieval-augmented generation

temporal novelty

Innovation

Methods, ideas, or system contributions that make the work stand out.

retrieval-augmented generation

multi-hop question answering

fine-tuning