(How) Do Large Language Models Understand High-Level Message Sequence Charts?

📅 2026-05-13

📈 Citations: 0

✨ Influential: 0

career value

169K/year

🤖 AI Summary

This study presents the first systematic evaluation of large language models’ (LLMs) capacity to comprehend the formal semantics of High-Level Message Sequence Charts (HMSCs), with a focus on semantic consistency in automated software architecture design. Using a benchmark of 129 structured semantic tasks, the authors assess Gemini-3, GPT-5.4, and Qwen-3.6 on their ability to reason about event ordering, abstraction composition, trace sets, and labeled transition systems (LTS). Results indicate an overall accuracy of approximately 52%, with strong performance on basic semantic concepts (88%) but significant shortcomings in higher-order reasoning—particularly in abstraction composition (36%) and trace/LTS inference (42%). The models also struggle to correctly handle concurrency and explicit causal dependencies. These findings provide empirical evidence delineating the current limitations of LLMs in formal modeling tasks.

📝 Abstract

Large Language Models (LLMs) are being employed widely to automate tasks across the software development life-cycle. It is, however, unclear whether these tasks are performed consistently with respect to the semantics of the artefacts being handled. This question is particularly under-researched concerning architectural design specification. In this paper, we address this question for High-Level Message Sequence Charts (HMSCs). These are visual models with a rigorous formal semantics that have been used for various purposes, including as a foundation for Sequence Diagrams in the Unified Modelling Language (UML). We examine whether LLMs "understand" the semantics of HMSCs by examining three LLMs (Gemini-3, GPT-5.4, and Qwen-3.6) on how they perform 129 semantic tasks ranging from querying basic semantic constructs in HMSCs (i.e., events and their ordering) to semantic-preserving abstractions and compositions, and calculating the set of traces and trace-equivalent labelled transition systems. The results show that LLMs only have a modest understanding of the formal semantics of HMSCs (ca. 52% overall accuracy), with great variability across different semantic concepts: while LLMs seem to understand the basic semantic concepts of MSCs (ca. 88% accuracy), they struggle with semantic reasoning in tasks involving abstraction and composition (ca. 36% accuracy) and traces and LTSs (ca. 42% accuracy). In particular, all three LLMs struggle with the notions of co-region and explicit causal dependencies and never employed them in semantic-preserving transformations.

Problem

Research questions and friction points this paper is trying to address.

Large Language Models

High-Level Message Sequence Charts

formal semantics

semantic understanding

software architecture

Innovation

Methods, ideas, or system contributions that make the work stand out.

Large Language Models

High-Level Message Sequence Charts

formal semantics