A Generalizable Framework for Building Executable Domain-Specific LLMs under Data Scarcity: Demonstration on Semiconductor TCAD Simulation

📅 2026-01-15
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the challenge of developing domain-specific large language models (LLMs) capable of generating both fluent natural language and executable code in data-scarce scientific and engineering domains. The authors propose a pattern-first alignment framework that injects domain knowledge through large-scale synthetic question-answering pairs and innovatively integrates intermediate representation (IR)-driven direct preference optimization (DPO) with controllable retrieval-augmented generation (RAG). This approach jointly enhances instruction following and code executability under low-resource conditions. Evaluated on the TCAD task, the method achieves 85.6% semantic accuracy and 80.0% syntactic pass rate, significantly outperforming GPT-4o. Its transferability is further validated on the Elmer solver, establishing a reproducible and generalizable paradigm for building domain-specific LLMs.

Technology Category

Application Category

📝 Abstract
Scientific and engineering verticals often suffer from data scarcity and strict executability requirements: models must generate not only fluent text, but also syntactically valid, tool-compilable scripts. We present a schema-first alignment framework for building compact, executable domain-specific LLMs in low-resource settings. The framework integrates three core components: (i) large-scale synthetic QA data generation from expert documentation to instill foundational domain knowledge; (ii) a code-centric IR->DPO workflow that converts verified tool decks into interpretable intermediate representations (IR), performs equivalence-preserving diversification, and constructs preference pairs to directly optimize instruction compliance and code executability; and (iii) a controlled evaluation of Retrieval-Augmented Generation (RAG), showing that while RAG benefits general LLMs, it can marginally degrade the performance of already domain-aligned models. We demonstrate the framework by instantiating TcadGPT for semiconductor Technology Computer-Aided Design (TCAD). Using 1.5M synthetic QA pairs and an IR-driven DPO dataset, TcadGPT attains 85.6% semantic accuracy and an 80.0% syntax pass rate on SDE executability tests, substantially outperforming state-of-the-art general LLMs such as GPT-4o. To probe portability beyond TCAD, we apply the same recipe to the open-source FEM solver Elmer, observing consistent improvements in script-level success rates over general-purpose baselines. All datasets, benchmarks, and code (including P1, P2, and IR->DPO) are released for reproducibility. Together, these results suggest that the proposed framework provides a robust and reproducible path toward executable LLMs in specialized, data-scarce professional domains.
Problem

Research questions and friction points this paper is trying to address.

data scarcity
executable LLMs
domain-specific modeling
code executability
low-resource settings
Innovation

Methods, ideas, or system contributions that make the work stand out.

executable LLMs
data scarcity
intermediate representation (IR)
DPO
domain-specific alignment
🔎 Similar Papers
No similar papers found.
D
Di Wang
Inspur Electronic Information Industry Co., Ltd, Beijing, China.
Z
Zhenhua Wu
Center for Quantum Matters, Zhejiang University, Hangzhou, China.
Y. Liu
Y. Liu
School of Electric Power Engineering, South China University of Technology
Power Systems
Kai Chang
Kai Chang
Center for Quantum Matter, School of Physics, Zhejiang University, Hangzhou 310058, China
Condensed Matter Physics
S
Shaohua Wu
Inspur Electronic Information Industry Co., Ltd, Beijing, China.