Domain-Specific Data Synthesis for LLMs via Minimal Sufficient Representation Learning

📅 2026-05-28

📈 Citations: 0

✨ Influential: 0

career value

181K/year

🤖 AI Summary

This work addresses the challenge of synthesizing domain-specific data from a small set of reference samples in the absence of explicit natural language descriptions. To this end, the authors propose DOMINO, a novel framework that introduces an inductive paradigm for data synthesis. DOMINO leverages contrastive disentangled learning to extract a minimal sufficient representation of the domain directly from the references, enabling large language models to generate data that is both domain-consistent and diverse—without requiring human-provided prompts or explicit domain definitions. Theoretical analysis demonstrates that this learned representation expands the support of the synthesized data distribution. Empirical results on code generation benchmarks with implicitly defined domains show that fine-tuning models on DOMINO-synthesized data improves Pass@1 accuracy by up to 4.63%, significantly outperforming strong baselines.

📝 Abstract

Large Language Models have demonstrated remarkable progress in general-purpose capabilities and can achieve strong performance in specific domains through fine-tuning on domain-specific data. However, acquiring high-quality data for target domains remains a significant challenge. Existing data synthesis approaches follow a deductive paradigm, heavily relying on explicit domain descriptions expressed in natural language and careful prompt engineering, limiting their applicability in real-world scenarios where domains are difficult to describe or formally articulate. In this work, we tackle the underexplored problem of domain-specific data synthesis through an inductive paradigm, where the target domain is defined only through a set of reference examples, particularly when domain characteristics are difficult to articulate in natural language. We propose a novel framework, DOMINO, that learns a minimal sufficient domain representation from reference samples and leverages it to guide the generation of domain-aligned synthetic data. DOMINO integrates prompt tuning with a contrastive disentanglement objective to separate domain-level patterns from sample-specific noise, mitigating overfitting while preserving core domain characteristics. Theoretically, we prove that DOMINO expands the support of the synthetic data distribution, ensuring greater diversity. Empirically, on challenging coding benchmarks where domain definitions are implicit, fine-tuning on data synthesized by DOMINO improves Pass@1 accuracy by up to 4.63\% over strong, instruction-tuned backbones, demonstrating its effectiveness and robustness. This work establishes a new paradigm for domain-specific data synthesis, enabling practical and scalable domain adaptation without manual prompt design or natural language domain specifications.

Problem

Research questions and friction points this paper is trying to address.

domain-specific data synthesis

inductive paradigm

reference examples

minimal sufficient representation

large language models

Innovation

Methods, ideas, or system contributions that make the work stand out.

domain-specific data synthesis

minimal sufficient representation

inductive paradigm