🤖 AI Summary
This work addresses the challenge in long-form retrieval-augmented generation (RAG) where conventional ranking methods often fail to retrieve all relevant factual nuggets, leading to incomplete or missing information in generated outputs. To overcome this limitation, the authors propose CoveR, a coverage-aware dense retrieval approach that introduces, for the first time, synthetic coverage signals derived from sub-question answerability. CoveR employs coverage-oriented contrastive learning and knowledge distillation objectives to train a dual-encoder retriever. Without compromising retrieval relevance, CoveR improves factual coverage by 10% across multiple benchmarks, substantially outperforming strong baselines and achieving a better trade-off between coverage breadth and relevance.
📝 Abstract
Long-form Retrieval-Augmented Generation (RAG) brings the challenge of coverage-based ranking, because ranking methods must ensure the inclusion of comprehensive relevant nuggets (i.e., facts), which can thereby be synthesized into a comprehensive output. In this work, we propose CoveR (Our code is available at https://github.com/DylanJoo/CoveR ) a dense retrieval method optimized for coverage-aware retrieval scenarios. CoveR is a bi-encoder trained with the coverage-based contrastive and distillation objectives, which enables CoveR to capture diverse aspects of information needs. To train CoveR, we create the SCOPE dataset, (Our training data is available at https://huggingface.co/datasets/DylanJHJ/scope ) which comprises 90K training pairs from Researchy Questions with synthetic coverage signals augmented from sub-question answerability judgments generated by LLMs. Our empirical experiments show that CoveR enhances nugget coverage by 10\% over strong dense retrieval baselines without sacrificing its relevance-based retrieval capability. Further ablation studies validate the importance of our proposed learning method, showing that CoveR achieves a superior trade-off between relevance- and coverage-based ranking, which is essential for long-form RAG.