Understanding Retrieval Augmentation for Long-Form Question Answering

📅 2023-10-18
🏛️ arXiv.org
📈 Citations: 41
Influential: 2
📄 PDF
🤖 AI Summary
This study investigates attribution reliability in Retrieval-Augmented Generation (RAG) for Long-Form Question Answering (LFQA)—specifically, how accurately and verifiably large language models (LLMs) ground answers in retrieved evidence. To this end, we construct the first human-annotated attribution dataset tailored to LFQA, enabling systematic analysis of answer variations across multiple LLMs conditioned on identical retrieval outputs. We quantify how retrieval quality affects answer fluency, length, and attribution accuracy. Further, we propose a semantic-alignment-based automatic attribution evaluation framework that identifies three canonical attribution errors—omission, misattribution, and fabrication—and their underlying causes. Experimental results demonstrate that retrieval quality strongly influences attribution correctness, and that attribution capability varies significantly across LLMs. Our work provides an empirical foundation and methodological toolkit for enhancing interpretability and trustworthiness assessment in RAG systems.
📝 Abstract
We present a study of retrieval-augmented language models (LMs) on long-form question answering. We analyze how retrieval augmentation impacts different LMs, by comparing answers generated from models while using the same evidence documents, and how differing quality of retrieval document set impacts the answers generated from the same LM. We study various attributes of generated answers (e.g., fluency, length, variance) with an emphasis on the attribution of generated long-form answers to in-context evidence documents. We collect human annotations of answer attribution and evaluate methods for automatically judging attribution. Our study provides new insights on how retrieval augmentation impacts long, knowledge-rich text generation of LMs. We further identify attribution patterns for long text generation and analyze the main culprits of attribution errors. Together, our analysis reveals how retrieval augmentation impacts long knowledge-rich text generation and provide directions for future work.
Problem

Research questions and friction points this paper is trying to address.

Investigating how retrieved documents enhance long-form question answering
Analyzing attribution of generated answers to in-context evidence documents
Evaluating automatic methods for measuring answer attribution accuracy
Innovation

Methods, ideas, or system contributions that make the work stand out.

Controlled studies on retrieval-augmented LMs for LFQA
Human-annotated dataset for sentence-level answer attribution
Analysis of document utilization in knowledge-rich text generation
Hung-Ting Chen
Hung-Ting Chen
New York University
Natural Language Processing
F
Fangyuan Xu
Department of Computer Science, The University of Texas at Austin
S
Shane A. Arora
Department of Computer Science, The University of Texas at Austin
Eunsol Choi
Eunsol Choi
New York University
natural language processingmachine learning