🤖 AI Summary
This study addresses the challenge of generating or modifying industrial-scale, multi-file domain-specific language (DSL) code from natural language instructions using large language models. The work proposes an end-to-end approach that encodes Xtext-based DSL repositories into a path-preserving JSON format, enabling coherent cross-file edits within a single model response. Key innovations include the first demonstration of single-instruction, multi-file DSL generation in industrial settings, a structure-preserving JSON representation, and task-oriented evaluation metrics. Leveraging Qwen2.5-Coder and DeepSeek-Coder (7B) models fine-tuned with QLoRA and augmented by in-context learning, the method achieves perfect structural fidelity (1.00), high exact match accuracy, and strong edit similarity on the test set. Practical utility is further confirmed through developer surveys and successful downstream compilation.
📝 Abstract
Large language models (LLMs) perform strongly on general-purpose code generation, yet their applicability to enterprise domain-specific languages (DSLs) remains underexplored, especially for repository-scale change generation spanning multiple files and folder structures from a single natural-language (NL) instruction. We report an industrial case study at BMW that adapts code-oriented LLMs to generate and modify project-root DSL artifacts for an Xtext-based DSL that drives downstream Java/TypeScript code generation. We develop an end-to-end pipeline for dataset construction, multi-file task representation, model adaptation, and evaluation. We encode DSL folder hierarchies as structured, path-preserving JSON, allowing single-response generation at repository scale and learning cross-file dependencies. We evaluate two instruction-tuned code LLMs (Qwen2.5-Coder and DeepSeek-Coder, 7B) under three configurations: baseline prompting, one-shot in-context learning, and parameter-efficient fine-tuning (QLoRA). Beyond standard similarity metrics, we introduce task-specific measures that assess edit correctness and repository structural fidelity. Fine-tuning yields the most significant gains across models and metrics, achieving high exact-match accuracy, substantial edit similarity, and structural fidelity of 1.00 on our held-out set for multi-file outputs. At the same time, one-shot in-context learning provides smaller but consistent improvements over baseline prompting. We further validate practical utility via an expert developer survey and an execution-based check using the existing code generator.