Leveraging LLMs to support co-evolution between definitions and instances of textual DSLs: A Systematic Evaluation

📅 2026-02-12

📈 Citations: 0

✨ Influential: 0

career value

182K/year

🤖 AI Summary

This study addresses the challenge of preserving human-centric semantic information—such as comments and layout—during the co-evolution of domain-specific language (DSL) grammars and their instances, a task where traditional model-driven approaches often fall short. For the first time, it systematically evaluates the effectiveness of large language models (LLMs) in this context, conducting multi-round experiments on ten DSLs using Claude Sonnet 4.5 and GPT-5.2. The investigation examines the impact of evolution complexity, deletion granularity, and instance size. Results show that for small-scale changes (<20 lines), LLMs achieve precision and recall rates of at least 94%. Claude maintains an 85% recall rate even at 40 lines, whereas GPT exhibits significant performance degradation on larger instances, accompanied by a sharp increase in response time as instance size grows.

Technology Category

Application Category

📝 Abstract

Software languages evolve over time for reasons such as feature additions. When grammars evolve, textual instances that originally conformed to them may become outdated. While model-driven engineering provides many techniques for co-evolving models with metamodel changes, these approaches are not designed for textual DSLs and may lose human-relevant information such as layout and comments. This study systematically evaluates the potential of large language models (LLMs) for co-evolving grammars and instances of textual DSLs. Using Claude Sonnet 4.5 and GPT-5.2 across ten case languages with ten runs each, we assess both correctness and preservation of human-oriented information. Results show strong performance on small-scale cases ($\geq$94% precision and recall for instances requiring fewer than 20 modified lines), but performance degraded with scale: Claude maintains 85% recall at 40 lines, while GPT fails on the largest instances. Response time increases substantially with instance size, and grammar evolution complexity and deletion granularity affect performance more than change type. These findings clarify when LLM-based co-evolution is effective and where current limitations remain.

Problem

Research questions and friction points this paper is trying to address.

textual DSLs

co-evolution

grammar evolution

human-relevant information

language models

Innovation

Methods, ideas, or system contributions that make the work stand out.

LLM-based co-evolution

textual DSLs

grammar evolution