AILS-NTUA at SemEval-2026 Task 8: Evaluating Multi-Turn RAG Conversations

📅 2026-03-11

📈 Citations: 0

✨ Influential: 0

career value

166K/year

🤖 AI Summary

This work addresses three core subtasks in multi-turn retrieval-augmented generation (RAG)—passage retrieval, reference-aware response generation, and end-to-end evaluation—by proposing a unified architecture. It replaces conventional retriever diversity with diverse LLM-driven query rewrites and enhances recall quality through a combination of sparse retrieval and variance-aware reciprocal rank fusion. A multi-stage generation mechanism is introduced, comprising evidence extraction, dual-candidate draft generation, and a calibration-based multi-rater selection process. The proposed method achieves state-of-the-art performance on Task A with an nDCG@5 of 0.5776, surpassing the strongest baseline by 20.5%, and ranks second on Task B with a harmonic mean (HM) of 0.7698, demonstrating significant improvements in both retrieval and generation capabilities for multi-turn RAG.

Technology Category

Application Category

📝 Abstract

We present the AILS-NTUA system for SemEval-2026 Task 8 (MTRAGEval), addressing all three subtasks of multi-turn retrieval-augmented generation: passage retrieval (A), reference-grounded response generation (B), and end-to-end RAG (C). Our unified architecture is built on two principles: (i) a query-diversity-over-retriever-diversity strategy, where five complementary LLM-based query reformulations are issued to a single corpus-aligned sparse retriever and fused via variance-aware nested Reciprocal Rank Fusion; and (ii) a multistage generation pipeline that decomposes grounded generation into evidence span extraction, dual-candidate drafting, and calibrated multi-judge selection. Our system ranks 1st in Task A (nDCG@5: 0.5776, +20.5% over the strongest baseline) and 2nd in Task B (HM: 0.7698). Empirical analysis shows that query diversity over a well-aligned retriever outperforms heterogeneous retriever ensembling, and that answerability calibration-rather than retrieval coverage-is the primary bottleneck in end-to-end performance.

Problem

Research questions and friction points this paper is trying to address.

multi-turn RAG

passage retrieval

reference-grounded response generation

end-to-end RAG

retrieval-augmented generation

Innovation

Methods, ideas, or system contributions that make the work stand out.

query diversity

sparse retriever

Reciprocal Rank Fusion