MHTS: Multi-Hop Tree Structure Framework for Generating Difficulty-Controllable QA Datasets for RAG Evaluation

📅 2025-03-29
🏛️ arXiv.org
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing RAG benchmarks largely overlook query difficulty, leading to inflated and unreliable evaluations. Robust assessment necessitates jointly considering answer quality, response diversity, and query difficulty. Method: We propose a fine-grained difficulty modeling framework based on multi-hop tree structures. Specifically: (1) we design a logically coherent multi-step query synthesis mechanism; (2) we formulate a difficulty metric integrating evidence distribution and reasoning depth; and (3) we establish a controllable data synthesis pipeline enabling difficulty-stratified dataset generation. Contribution/Results: To our knowledge, this is the first work to introduce a difficulty estimation algorithm that jointly evaluates retrieval and generation capabilities within RAG. Experiments show strong correlation (r > 0.85) between our estimated query difficulty and end-to-end RAG performance, significantly enhancing evaluation robustness and interpretability.

Technology Category

Application Category

📝 Abstract
Existing RAG benchmarks often overlook query difficulty, leading to inflated performance on simpler questions and unreliable evaluations. A robust benchmark dataset must satisfy three key criteria: quality, diversity, and difficulty, which capturing the complexity of reasoning based on hops and the distribution of supporting evidence. In this paper, we propose MHTS (Multi-Hop Tree Structure), a novel dataset synthesis framework that systematically controls multi-hop reasoning complexity by leveraging a multi-hop tree structure to generate logically connected, multi-chunk queries. Our fine-grained difficulty estimation formula exhibits a strong correlation with the overall performance metrics of a RAG system, validating its effectiveness in assessing both retrieval and answer generation capabilities. By ensuring high-quality, diverse, and difficulty-controlled queries, our approach enhances RAG evaluation and benchmarking capabilities.
Problem

Research questions and friction points this paper is trying to address.

Existing RAG benchmarks lack query difficulty control
Need datasets with quality, diversity, and difficulty criteria
Propose MHTS framework for controllable multi-hop reasoning complexity
Innovation

Methods, ideas, or system contributions that make the work stand out.

Multi-hop tree structure controls reasoning complexity
Fine-grained difficulty formula correlates with performance
Generates diverse, quality, difficulty-controlled queries
🔎 Similar Papers
No similar papers found.
J
Jeongsoo Lee
DATUMO
D
Daeyong Kwon
DATUMO, Graduate School of Culture Technology, Korea Advanced Institute of Science & Technology
Kyohoon Jin
Kyohoon Jin
DATUMO
Natural Language Processing
J
Junnyeong Jeong
DATUMO
M
Minwoo Sim
DATUMO
M
Minwoo Kim
DATUMO