🤖 AI Summary
This work addresses the joint optimization of retrieval and prompting strategies in RAG systems. Leveraging a unified corpus (Fineweb-10BT) and an open-source model (Falcon3-10B-Instruct), it organized a timed question-answering competition involving 70 international teams. A two-stage evaluation paradigm—combining LLM-as-a-judge with human verification—was introduced to enable fine-grained, reproducible assessment of answer correctness and faithfulness for the first time. Systematic evaluation on 500 unseen questions benchmarked the generalization capabilities of diverse RAG architectures and identified key design principles for efficient retrieval and robust prompting. The study, published at SIGIR 2025, establishes a standardized empirical benchmark for RAG research, accompanied by an open dataset and a fully reproducible evaluation protocol.
📝 Abstract
The LiveRAG Challenge at SIGIR 2025, held between March and May 2025, provided a competitive platform for advancing Retrieval-Augmented Generation (RAG) technologies. Participants from academia and industry were invited to develop a RAG-based question-answering system using a fixed corpus (Fineweb-10BT) and a common open-source LLM (Falcon3-10B-Instruct). The goal was to facilitate challenging comparisons of retrieval and prompting strategies. During the Live Challenge Day, 70 teams from 27 different countries provided answers and supportive information to 500 unseen questions within a strict two-hour time window. Evaluation was conducted in two stages: first an automated LLM-as-a-judge approach was used to compute correctness and faithfulness score, then a manual review of top ranked submissions was conducted. The finalists were announced on June 12, 2025, with prizes awarded during the LiveRAG Workshop at SIGIR 2025 in Padua, Italy.