🤖 AI Summary
Generating high-quality open-domain question-answer pairs for specialized domains (e.g., railway safety regulations) remains challenged by insufficient utilization of expert demonstrations and difficulty in balancing thematic diversity. This paper proposes the first generative protocol integrating few-shot learning with structured topic/style classification. We empirically uncover a systematic surface-level stylistic bias in large language model (LLM) evaluators—while demonstrating their superior fidelity in preserving cognitive complexity distributions. Our method incorporates Bloom’s taxonomy-based cognitive hierarchy analysis, domain adaptation grounded in authentic regulatory documents, and a retrieval-augmented evaluation framework. Experiments show that our approach achieves twice the generation efficiency of baseline methods and attains 94.4% topic coverage. Fine-tuning a retrieval model boosts top-1 accuracy by 13.02%, significantly enhancing downstream task performance.
📝 Abstract
Generating high-quality question-answer pairs for specialized technical domains remains challenging, with existing approaches facing a tradeoff between leveraging expert examples and achieving topical diversity. We present ExpertGenQA, a protocol that combines few-shot learning with structured topic and style categorization to generate comprehensive domain-specific QA pairs. Using U.S. Federal Railroad Administration documents as a test bed, we demonstrate that ExpertGenQA achieves twice the efficiency of baseline few-shot approaches while maintaining $94.4%$ topic coverage. Through systematic evaluation, we show that current LLM-based judges and reward models exhibit strong bias toward superficial writing styles rather than content quality. Our analysis using Bloom's Taxonomy reveals that ExpertGenQA better preserves the cognitive complexity distribution of expert-written questions compared to template-based approaches. When used to train retrieval models, our generated queries improve top-1 accuracy by $13.02%$ over baseline performance, demonstrating their effectiveness for downstream applications in technical domains.