Building Domain-Specific Small Language Models via Guided Data Generation

📅 2025-11-23

📈 Citations: 0

✨ Influential: 0

career value

176K/year

🤖 AI Summary

To address three key challenges in deploying large language models (LLMs) for domain-specific SaaS applications—data privacy risks, high adaptation costs for open-source models, and scarcity of high-quality annotated data—this paper proposes a low-resource paradigm for constructing small domain-specific language models (SLMs). Methodologically, it integrates guided synthetic data generation with hierarchical real-data curation, forming a “seed-corpus-guided → hierarchical collection → three-stage training” framework. Training proceeds through domain-adaptive pretraining (DAPT), domain-specific supervised fine-tuning (DSFT), and direct preference optimization (DPO). Evaluated on industrial fault diagnosis, the 3B-parameter DiagnosticSLM substantially outperforms open-source baselines (2B–9B parameters), achieving up to 25% absolute improvement in multiple-choice question (MCQ) accuracy across benchmarks. This demonstrates DiagnosticSLM’s superior domain reasoning capability and generalization under resource-constrained settings.

Technology Category

Application Category

📝 Abstract

Large Language Models (LLMs) have shown remarkable success in supporting a wide range of knowledge-intensive tasks. In specialized domains, there is growing interest in leveraging LLMs to assist subject matter experts with domain-specific challenges. However, deploying LLMs as SaaS solutions raises data privacy concerns, while many open-source models demand significant computational resources for effective domain adaptation and deployment. A promising alternative is to develop smaller, domain-specialized LLMs, though this approach is often constrained by the lack of high-quality domain-specific training data. In this work, we address these limitations by presenting a cost-efficient and scalable training pipeline that combines guided synthetic data generation from a small seed corpus with bottom-up domain data curation. Our pipeline integrates Domain-Adaptive Pretraining (DAPT), Domain-specific Supervised Fine-tuning (DSFT), and Direct Preference Optimization (DPO) to train effective small-scale models for specialized use cases. We demonstrate this approach through DiagnosticSLM, a 3B-parameter domain-specific model tailored for fault diagnosis, root cause analysis, and repair recommendation in industrial settings. To evaluate model performance, we introduce four domain-specific benchmarks: multiple-choice questions (DiagnosticMCQ), question answering (DiagnosticQA), sentence completion (DiagnosticComp), and summarization (DiagnosticSum). DiagnosticSLM achieves up to 25% accuracy improvement over open-source models of comparable or larger size (2B-9B) on the MCQ task, while also outperforming or matching them in other tasks, demonstrating effective domain-specific reasoning and generalization capabilities.

Problem

Research questions and friction points this paper is trying to address.

Develops small domain-specific LLMs for specialized tasks

Addresses data privacy and computational resource limitations

Generates high-quality domain data for effective model training

Innovation

Methods, ideas, or system contributions that make the work stand out.

Guided synthetic data generation from seed corpus

Domain-Adaptive Pretraining and Supervised Fine-tuning pipeline

Small-scale model training with Direct Preference Optimization

🔎 Similar Papers

Auto-GDA: Automatic Domain Adaptation for Efficient Grounding Verification in Retrieval Augmented Generation