OptiSet: Unified Optimizing Set Selection and Ranking for Retrieval-Augmented Generation

📅 2026-01-08

🏛️ arXiv.org

📈 Citations: 0

✨ Influential: 0

career value

165K/year

🤖 AI Summary

This work addresses a key limitation in existing retrieval-augmented generation (RAG) approaches, which typically select top-k passages statically, thereby overlooking inter-passage complementarity and introducing redundancy. To overcome this, the authors propose OptiSet, a novel framework that unifies evidence set selection and set-level ranking within a single modeling paradigm for the first time. OptiSet employs an “expand-and-refine” strategy to construct diverse candidate sets and integrates multi-view query expansion, candidate reselection, and a self-synthesized preference learning mechanism to generate high-quality preference labels without relying on strong-supervision from large language models. This approach effectively distinguishes complementary from redundant evidence, significantly enhancing both generation quality and efficiency on complex compositional tasks, and demonstrates the efficacy of compact, high-gain evidence sets.

Technology Category

Application Category

📝 Abstract

Retrieval-Augmented Generation (RAG) improves generation quality by incorporating evidence retrieved from large external corpora. However, most existing methods rely on statically selecting top-k passages based on individual relevance, which fails to exploit combinatorial gains among passages and often introduces substantial redundancy. To address this limitation, we propose OptiSet, a set-centric framework that unifies set selection and set-level ranking for RAG. OptiSet adopts an"Expand-then-Refine"paradigm: it first expands a query into multiple perspectives to enable a diverse candidate pool and then refines the candidate pool via re-selection to form a compact evidence set. We then devise a self-synthesis strategy without strong LLM supervision to derive preference labels from the set conditional utility changes of the generator, thereby identifying complementary and redundant evidence. Finally, we introduce a set-list wise training strategy that jointly optimizes set selection and set-level ranking, enabling the model to favor compact, high-gain evidence sets. Extensive experiments demonstrate that OptiSet improves performance on complex combinatorial problems and makes generation more efficient. The source code is publicly available.

Problem

Research questions and friction points this paper is trying to address.

Retrieval-Augmented Generation

set selection

redundancy

combinatorial gains

evidence ranking

Innovation

Methods, ideas, or system contributions that make the work stand out.

Retrieval-Augmented Generation

Set Selection

Set-Level Ranking