H-RAG at SemEval-2026 Task 8: Hierarchical Parent-Child Retrieval for Multi-Turn RAG Conversations

📅 2026-05-01

📈 Citations: 0

✨ Influential: 0

career value

176K/year

🤖 AI Summary

This work addresses the limitations of retrieval-augmented generation (RAG) in multi-turn dialogue, particularly insufficient retrieval precision and poor generation faithfulness. To this end, the authors propose a hierarchical parent-child retrieval framework that splits documents into overlapping sentence-level child chunks for fine-grained retrieval while preserving the original parent documents to reconstruct contextual coherence. The approach integrates hybrid dense-sparse retrieval, learnable weight fusion, and embedding-based reranking, followed by instruction fine-tuning of the language model. Evaluated on benchmark tasks, the method achieves an nDCG@5 of 0.4271 on Task A and a composite score of 0.3241 on Task C (comprising RB_agg: 0.2488, RL_F: 0.2703, and RB_llm: 0.6508), demonstrating significant improvements in factual consistency and contextual coherence of end-to-end generated responses.

📝 Abstract

We present H-RAG, our submission to SemEval-2026 Task 8 (MTRAGEval), addressing both Task A (Retrieval) and Task C (Generation with Retrieved Passages). Task A evaluates standalone retrieval quality, while Task C assesses end-to-end retrieval-augmented generation (RAG) in multi-turn conversational settings, requiring both accurate answer generation and faithful grounding in retrieved evidence. Our approach implements a hierarchical parent-child RAG pipeline that separates fine-grained child-level retrieval from parent-level context reconstruction during generation. Documents are segmented into overlapping sentence-based child chunks, while full documents are preserved as parent units to provide coherent context. Retrieval combines hybrid dense-sparse search, tunable weighting, and embedding-based similarity rescoring over child chunks. Retrieved evidence is aggregated at the parent level and supplied to an instruction-tuned language model for response generation. H-RAG achieves an nDCG@5 score of 0.4271 on Task A and a harmonic mean score of 0.3241 on Task C (RB_agg: 0.2488, RL_F: 0.2703, RB_llm: 0.6508), underscoring the importance of retrieval configuration and parent-level aggregation in multi-turn RAG performance.

Problem

Research questions and friction points this paper is trying to address.

multi-turn RAG

retrieval-augmented generation

conversational retrieval

evidence grounding

retrieval quality

Innovation

Methods, ideas, or system contributions that make the work stand out.

Hierarchical RAG

Parent-Child Retrieval

Multi-turn Conversations