TopClustRAG at SIGIR 2025 LiveRAG Challenge

📅 2025-06-18
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address low answer diversity, weak relevance, and insufficient evidence faithfulness in large-scale dynamic web question answering, this paper proposes a clustering-enhanced Retrieval-Augmented Generation (RAG) framework. Methodologically, it integrates hybrid retrieval combining BM25 and Dense Passage Retrieval (DPR), applies K-Means semantic clustering to denoise retrieved evidence, designs cluster-level prompt generation and multi-stage answer synthesis for knowledge-complementary reasoning, and refines context quality via rule-based filtering and cross-encoder re-ranking. The key contributions are the first introduction of semantic clustering–driven context filtering and cluster-specific prompt aggregation. Evaluated on the FineWeb Sample-10BT benchmark, the framework achieves 2nd place in faithfulness and 7th in correctness on the official leaderboard, demonstrating its effectiveness and robustness over ultra-large-scale corpora.

Technology Category

Application Category

📝 Abstract
We present TopClustRAG, a retrieval-augmented generation (RAG) system developed for the LiveRAG Challenge, which evaluates end-to-end question answering over large-scale web corpora. Our system employs a hybrid retrieval strategy combining sparse and dense indices, followed by K-Means clustering to group semantically similar passages. Representative passages from each cluster are used to construct cluster-specific prompts for a large language model (LLM), generating intermediate answers that are filtered, reranked, and finally synthesized into a single, comprehensive response. This multi-stage pipeline enhances answer diversity, relevance, and faithfulness to retrieved evidence. Evaluated on the FineWeb Sample-10BT dataset, TopClustRAG ranked 2nd in faithfulness and 7th in correctness on the official leaderboard, demonstrating the effectiveness of clustering-based context filtering and prompt aggregation in large-scale RAG systems.
Problem

Research questions and friction points this paper is trying to address.

Enhancing answer diversity in RAG systems
Improving relevance of generated responses
Ensuring faithfulness to retrieved evidence
Innovation

Methods, ideas, or system contributions that make the work stand out.

Hybrid sparse and dense retrieval indices
K-Means clustering for semantic passage grouping
Cluster-specific prompts for LLM answer synthesis
🔎 Similar Papers
No similar papers found.