🤖 AI Summary
High-quality multi-hop question answering (QA) datasets are scarce, especially those that simultaneously exhibit retrieval robustness—requiring integration of ambiguous, cross-domain clues across multiple hops—and answer verifiability; manual construction is prohibitively expensive and unscalable.
Method: We propose a bottom-up automated generation framework: (1) constructing logically coherent evidence clusters from semi-structured data; (2) reverse question generation via natural language inference (NLI)-based relation labeling and diversity augmentation to ensure single-hop inanswerability and multi-hop answer uniqueness; and (3) a two-stage quality pipeline integrating NLI validation, multi-model consensus filtering, structured constraint decomposition, and evidence alignment.
Contribution/Results: Our method enables large-scale, low-cost, controllable generation of high-difficulty multi-hop QA data for the first time. The resulting dataset is compatible with both supervised fine-tuning and reinforcement learning, significantly improving training efficiency of reasoning agents and enhancing evaluation rigor.
📝 Abstract
Building training-ready multi-hop question answering (QA) datasets that truly stress a model's retrieval and reasoning abilities remains highly challenging recently. While there have been a few recent evaluation datasets that capture the characteristics of hard-to-search but easy-to-verify problems -- requiring the integration of ambiguous, indirect, and cross-domain cues -- these data resources remain scarce and are mostly designed for evaluation, making them unsuitable for supervised fine-tuning (SFT) or reinforcement learning (RL). Meanwhile, manually curating non-trivially retrievable questions -- where answers cannot be found through a single direct query but instead require multi-hop reasoning over oblique and loosely connected evidence -- incurs prohibitive human costs and fails to scale, creating a critical data bottleneck for training high-capability retrieval-and-reasoning agents.
To address this, we present an automated framework for generating high-difficulty, training-ready multi-hop questions from semi-structured knowledge sources. The system (i) grows diverse, logically labeled evidence clusters through Natural Language Inference (NLI)-based relation typing and diversity-aware expansion; (ii) applies reverse question construction to compose oblique cues so that isolated signals are underinformative but their combination uniquely identifies the target entity; and (iii) enforces quality with a two-step evaluation pipeline that combines multi-model consensus filtering with structured constraint decomposition and evidence-based matching. The result is a scalable process that yields complex, retrieval-resistant yet verifiable questions suitable for SFT/RL training as well as challenging evaluation, substantially reducing human curation effort while preserving the difficulty profile of strong evaluation benchmarks.