🤖 AI Summary
To address low answer diversity, weak relevance, and insufficient evidence faithfulness in large-scale dynamic web question answering, this paper proposes a clustering-enhanced Retrieval-Augmented Generation (RAG) framework. Methodologically, it integrates hybrid retrieval combining BM25 and Dense Passage Retrieval (DPR), applies K-Means semantic clustering to denoise retrieved evidence, designs cluster-level prompt generation and multi-stage answer synthesis for knowledge-complementary reasoning, and refines context quality via rule-based filtering and cross-encoder re-ranking. The key contributions are the first introduction of semantic clustering–driven context filtering and cluster-specific prompt aggregation. Evaluated on the FineWeb Sample-10BT benchmark, the framework achieves 2nd place in faithfulness and 7th in correctness on the official leaderboard, demonstrating its effectiveness and robustness over ultra-large-scale corpora.
📝 Abstract
We present TopClustRAG, a retrieval-augmented generation (RAG) system developed for the LiveRAG Challenge, which evaluates end-to-end question answering over large-scale web corpora. Our system employs a hybrid retrieval strategy combining sparse and dense indices, followed by K-Means clustering to group semantically similar passages. Representative passages from each cluster are used to construct cluster-specific prompts for a large language model (LLM), generating intermediate answers that are filtered, reranked, and finally synthesized into a single, comprehensive response. This multi-stage pipeline enhances answer diversity, relevance, and faithfulness to retrieved evidence. Evaluated on the FineWeb Sample-10BT dataset, TopClustRAG ranked 2nd in faithfulness and 7th in correctness on the official leaderboard, demonstrating the effectiveness of clustering-based context filtering and prompt aggregation in large-scale RAG systems.