AILS-NTUA at SemEval-2026 Task 8: Evaluating Multi-Turn RAG Conversations

📅 2026-03-11
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses three core subtasks in multi-turn retrieval-augmented generation (RAG)—passage retrieval, reference-aware response generation, and end-to-end evaluation—by proposing a unified architecture. It replaces conventional retriever diversity with diverse LLM-driven query rewrites and enhances recall quality through a combination of sparse retrieval and variance-aware reciprocal rank fusion. A multi-stage generation mechanism is introduced, comprising evidence extraction, dual-candidate draft generation, and a calibration-based multi-rater selection process. The proposed method achieves state-of-the-art performance on Task A with an nDCG@5 of 0.5776, surpassing the strongest baseline by 20.5%, and ranks second on Task B with a harmonic mean (HM) of 0.7698, demonstrating significant improvements in both retrieval and generation capabilities for multi-turn RAG.

Technology Category

Application Category

📝 Abstract
We present the AILS-NTUA system for SemEval-2026 Task 8 (MTRAGEval), addressing all three subtasks of multi-turn retrieval-augmented generation: passage retrieval (A), reference-grounded response generation (B), and end-to-end RAG (C). Our unified architecture is built on two principles: (i) a query-diversity-over-retriever-diversity strategy, where five complementary LLM-based query reformulations are issued to a single corpus-aligned sparse retriever and fused via variance-aware nested Reciprocal Rank Fusion; and (ii) a multistage generation pipeline that decomposes grounded generation into evidence span extraction, dual-candidate drafting, and calibrated multi-judge selection. Our system ranks 1st in Task A (nDCG@5: 0.5776, +20.5% over the strongest baseline) and 2nd in Task B (HM: 0.7698). Empirical analysis shows that query diversity over a well-aligned retriever outperforms heterogeneous retriever ensembling, and that answerability calibration-rather than retrieval coverage-is the primary bottleneck in end-to-end performance.
Problem

Research questions and friction points this paper is trying to address.

multi-turn RAG
passage retrieval
reference-grounded response generation
end-to-end RAG
retrieval-augmented generation
Innovation

Methods, ideas, or system contributions that make the work stand out.

query diversity
sparse retriever
Reciprocal Rank Fusion
multistage generation pipeline
answerability calibration