D-SCoRE: Document-Centric Segmentation and CoT Reasoning with Structured Export for QA-CoT Data Generation

📅 2025-08-02
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
High-quality domain-specific question-answering (QA) data is scarce and expensive to construct, severely hindering supervised fine-tuning (SFT) of domain large language models (LLMs). To address this, we propose D-SCoRE—a training-free, prompt-engineering-based framework for automated QA-chain-of-thought (CoT) data generation. D-SCoRE integrates multiple controllable mechanisms: document-centric text segmentation, semantic role transformation, question-type balancing, and counterfactual augmentation—combined with CoT reasoning and structured output formatting. It enables efficient generation on consumer-grade hardware: an 8B-parameter model produces six four-choice counterfactual QA-CoT samples per 100–200-word input in under 90 seconds. Experiments demonstrate that models fine-tuned on D-SCoRE-generated data significantly outperform baselines on SQuADShifts and Covid-QA, achieving performance comparable to that obtained with human-annotated data.

Technology Category

Application Category

📝 Abstract
The scarcity and high cost of high-quality question-answering (QA) datasets hinder supervised fine-tuning (SFT) for domain-specific large language models (LLMs). To address this, we introduce D-SCoRE, a training-free pipeline that utilizes LLMs and prompt engineering to produce diverse, high-quality QA datasets from arbitrary textual sources. D-SCoRE integrates $ extbf{D}$ocument-centric processing, $ extbf{S}$egmentation, $ extbf{Co}$T $ extbf{R}$easoning, and structured $ extbf{E}$xport to generate QA-COT datasets tailored for domain-aware SFT. Multi-dimensional control mechanisms, such as semantic role transformation, question type balancing, and counterfactual materials, enhance diversity and relevance, overcoming limitations of existing QA generation. LLMs fine-tuned on D-SCoRE-generated QA datasets, and human-annotated QA datasets (SQuAD, Covid-QA) are evaluated on SQuADShifts and Covid-QA test sets, with D-SCoRE outperforming across most domains. D-SCoRE generates six QA-CoT pairs with four-option counterfactual materials per 100-200-word text in 90 seconds using an 8B LLM on consumer-grade hardware. Its simplicity and scalability enable efficient QA generation and high-performance fine-tuning across domains.
Problem

Research questions and friction points this paper is trying to address.

Generates QA datasets from text for domain-specific LLMs
Enhances QA diversity via semantic roles and counterfactuals
Enables efficient QA-CoT pair production on consumer hardware
Innovation

Methods, ideas, or system contributions that make the work stand out.

Document-centric processing and segmentation
Chain-of-Thought reasoning for QA generation
Structured export for domain-specific datasets
W
Weibo Zhou
City University of Hong Kong
L
Lingbo Li
University of Warwick
Shangsong Liang
Shangsong Liang
Sun Yat-sen University
Natural Language ProcessingMachine LearningInformation Retrieval