🤖 AI Summary
This work addresses the limited response diversity and inadequate information coverage in retrieval-augmented generation (RAG). We propose a two-stage Plan-and-Refine framework: (1) a planning stage that generates a global query plan via multi-perspective prompting to explicitly model diversity; and (2) a refinement stage that iteratively executes conditional generation, self-refinement, and joint evaluation—assessing both factual consistency and coverage—followed by ICAT-driven reward modeling to select the optimal response. This work introduces the first RAG paradigm integrating *planning-first*, *iterative refinement*, and *joint evaluation*. Experiments on ANTIQUE and TREC benchmarks show improvements of 13.1% and 15.41%, respectively, over strong baselines. A user study further confirms significant gains in response quality, usability, and perceived informativeness.
📝 Abstract
This paper studies the limitations of (retrieval-augmented) large language models (LLMs) in generating diverse and comprehensive responses, and introduces the Plan-and-Refine (P&R) framework based on a two phase system design. In the global exploration phase, P&R generates a diverse set of plans for the given input, where each plan consists of a list of diverse query aspects with corresponding additional descriptions. This phase is followed by a local exploitation phase that generates a response proposal for the input query conditioned on each plan and iteratively refines the proposal for improving the proposal quality. Finally, a reward model is employed to select the proposal with the highest factuality and coverage. We conduct our experiments based on the ICAT evaluation methodology--a recent approach for answer factuality and comprehensiveness evaluation. Experiments on the two diverse information seeking benchmarks adopted from non-factoid question answering and TREC search result diversification tasks demonstrate that P&R significantly outperforms baselines, achieving up to a 13.1% improvement on the ANTIQUE dataset and a 15.41% improvement on the TREC dataset. Furthermore, a smaller scale user study confirms the substantial efficacy of the P&R framework.