Building Domain-Specific Small Language Models via Guided Data Generation

📅 2025-11-23
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address three key challenges in deploying large language models (LLMs) for domain-specific SaaS applications—data privacy risks, high adaptation costs for open-source models, and scarcity of high-quality annotated data—this paper proposes a low-resource paradigm for constructing small domain-specific language models (SLMs). Methodologically, it integrates guided synthetic data generation with hierarchical real-data curation, forming a “seed-corpus-guided → hierarchical collection → three-stage training” framework. Training proceeds through domain-adaptive pretraining (DAPT), domain-specific supervised fine-tuning (DSFT), and direct preference optimization (DPO). Evaluated on industrial fault diagnosis, the 3B-parameter DiagnosticSLM substantially outperforms open-source baselines (2B–9B parameters), achieving up to 25% absolute improvement in multiple-choice question (MCQ) accuracy across benchmarks. This demonstrates DiagnosticSLM’s superior domain reasoning capability and generalization under resource-constrained settings.

Technology Category

Application Category

📝 Abstract
Large Language Models (LLMs) have shown remarkable success in supporting a wide range of knowledge-intensive tasks. In specialized domains, there is growing interest in leveraging LLMs to assist subject matter experts with domain-specific challenges. However, deploying LLMs as SaaS solutions raises data privacy concerns, while many open-source models demand significant computational resources for effective domain adaptation and deployment. A promising alternative is to develop smaller, domain-specialized LLMs, though this approach is often constrained by the lack of high-quality domain-specific training data. In this work, we address these limitations by presenting a cost-efficient and scalable training pipeline that combines guided synthetic data generation from a small seed corpus with bottom-up domain data curation. Our pipeline integrates Domain-Adaptive Pretraining (DAPT), Domain-specific Supervised Fine-tuning (DSFT), and Direct Preference Optimization (DPO) to train effective small-scale models for specialized use cases. We demonstrate this approach through DiagnosticSLM, a 3B-parameter domain-specific model tailored for fault diagnosis, root cause analysis, and repair recommendation in industrial settings. To evaluate model performance, we introduce four domain-specific benchmarks: multiple-choice questions (DiagnosticMCQ), question answering (DiagnosticQA), sentence completion (DiagnosticComp), and summarization (DiagnosticSum). DiagnosticSLM achieves up to 25% accuracy improvement over open-source models of comparable or larger size (2B-9B) on the MCQ task, while also outperforming or matching them in other tasks, demonstrating effective domain-specific reasoning and generalization capabilities.
Problem

Research questions and friction points this paper is trying to address.

Develops small domain-specific LLMs for specialized tasks
Addresses data privacy and computational resource limitations
Generates high-quality domain data for effective model training
Innovation

Methods, ideas, or system contributions that make the work stand out.

Guided synthetic data generation from seed corpus
Domain-Adaptive Pretraining and Supervised Fine-tuning pipeline
Small-scale model training with Direct Preference Optimization
🔎 Similar Papers
No similar papers found.
A
Aman Kumar
Hitachi America Ltd., Santa Clara, CA, USA
E
Ekant Muljibhai Amin
Hitachi Ltd., Tokyo, Japan
Xian Yeow Lee
Xian Yeow Lee
Hitachi America Ltd.
Machine LearningDeep LearningReinforcement LearningData ScienceEngineering
L
Lasitha Vidyaratne
Hitachi America Ltd., Santa Clara, CA, USA
A
Ahmed K. Farahat
Hitachi America Ltd., Santa Clara, CA, USA
D
Dipanjan D. Ghosh
Hitachi America Ltd., Santa Clara, CA, USA
Yuta Koreeda
Yuta Koreeda
Hitachi, Ltd., Hitachi America, Ltd., Stanford CS
natural language processingmachine learningrobotcomputer assisted surgery
Chetan Gupta
Chetan Gupta
Hitachi America Ltd., Santa Clara, CA, USA