🤖 AI Summary
This work addresses the challenge of developing domain-specific large language models (LLMs) capable of generating both fluent natural language and executable code in data-scarce scientific and engineering domains. The authors propose a pattern-first alignment framework that injects domain knowledge through large-scale synthetic question-answering pairs and innovatively integrates intermediate representation (IR)-driven direct preference optimization (DPO) with controllable retrieval-augmented generation (RAG). This approach jointly enhances instruction following and code executability under low-resource conditions. Evaluated on the TCAD task, the method achieves 85.6% semantic accuracy and 80.0% syntactic pass rate, significantly outperforming GPT-4o. Its transferability is further validated on the Elmer solver, establishing a reproducible and generalizable paradigm for building domain-specific LLMs.
📝 Abstract
Scientific and engineering verticals often suffer from data scarcity and strict executability requirements: models must generate not only fluent text, but also syntactically valid, tool-compilable scripts. We present a schema-first alignment framework for building compact, executable domain-specific LLMs in low-resource settings. The framework integrates three core components: (i) large-scale synthetic QA data generation from expert documentation to instill foundational domain knowledge; (ii) a code-centric IR->DPO workflow that converts verified tool decks into interpretable intermediate representations (IR), performs equivalence-preserving diversification, and constructs preference pairs to directly optimize instruction compliance and code executability; and (iii) a controlled evaluation of Retrieval-Augmented Generation (RAG), showing that while RAG benefits general LLMs, it can marginally degrade the performance of already domain-aligned models. We demonstrate the framework by instantiating TcadGPT for semiconductor Technology Computer-Aided Design (TCAD). Using 1.5M synthetic QA pairs and an IR-driven DPO dataset, TcadGPT attains 85.6% semantic accuracy and an 80.0% syntax pass rate on SDE executability tests, substantially outperforming state-of-the-art general LLMs such as GPT-4o. To probe portability beyond TCAD, we apply the same recipe to the open-source FEM solver Elmer, observing consistent improvements in script-level success rates over general-purpose baselines. All datasets, benchmarks, and code (including P1, P2, and IR->DPO) are released for reproducibility. Together, these results suggest that the proposed framework provides a robust and reproducible path toward executable LLMs in specialized, data-scarce professional domains.