🤖 AI Summary
Existing RAG benchmarks largely overlook query difficulty, leading to inflated and unreliable evaluations. Robust assessment necessitates jointly considering answer quality, response diversity, and query difficulty.
Method: We propose a fine-grained difficulty modeling framework based on multi-hop tree structures. Specifically: (1) we design a logically coherent multi-step query synthesis mechanism; (2) we formulate a difficulty metric integrating evidence distribution and reasoning depth; and (3) we establish a controllable data synthesis pipeline enabling difficulty-stratified dataset generation.
Contribution/Results: To our knowledge, this is the first work to introduce a difficulty estimation algorithm that jointly evaluates retrieval and generation capabilities within RAG. Experiments show strong correlation (r > 0.85) between our estimated query difficulty and end-to-end RAG performance, significantly enhancing evaluation robustness and interpretability.
📝 Abstract
Existing RAG benchmarks often overlook query difficulty, leading to inflated performance on simpler questions and unreliable evaluations. A robust benchmark dataset must satisfy three key criteria: quality, diversity, and difficulty, which capturing the complexity of reasoning based on hops and the distribution of supporting evidence. In this paper, we propose MHTS (Multi-Hop Tree Structure), a novel dataset synthesis framework that systematically controls multi-hop reasoning complexity by leveraging a multi-hop tree structure to generate logically connected, multi-chunk queries. Our fine-grained difficulty estimation formula exhibits a strong correlation with the overall performance metrics of a RAG system, validating its effectiveness in assessing both retrieval and answer generation capabilities. By ensuring high-quality, diverse, and difficulty-controlled queries, our approach enhances RAG evaluation and benchmarking capabilities.