MermaidSeqBench: An Evaluation Benchmark for LLM-to-Mermaid Sequence Diagram Generation

📅 2025-11-18

📈 Citations: 0

✨ Influential: 0

career value

173K/year

🤖 AI Summary

Existing research lacks a systematic benchmark for evaluating large language models’ (LLMs) ability to generate syntactically and semantically correct Mermaid sequence diagrams from natural language specifications. Method: We introduce the first dedicated benchmark for this task, comprising a human-validated core dataset and an LLM-synthesized extension set. We propose an innovative LLM-as-a-judge multi-model evaluation framework that supports fine-grained assessment across syntax correctness, activation control, and error handling. Dataset construction integrates human annotation, in-context learning prompts, rule-based mutation, and LLM-driven synthesis. Contribution/Results: Empirical evaluation across multiple state-of-the-art LLMs reveals substantial performance disparities, demonstrating the benchmark’s discriminative power, flexibility, and validity. This work fills a critical gap in the evaluation of structured diagram generation—a previously underexplored domain in LLM assessment.

Technology Category

Application Category

📝 Abstract

Large language models (LLMs) have demonstrated excellent capabilities in generating structured diagrams from natural language descriptions. In particular, they have shown great promise in generating sequence diagrams for software engineering, typically represented in a text-based syntax such as Mermaid. However, systematic evaluations in this space remain underdeveloped as there is a lack of existing benchmarks to assess the LLM's correctness in this task. To address this shortcoming, we introduce MermaidSeqBench, a human-verified and LLM-synthetically-extended benchmark for assessing an LLM's capabilities in generating Mermaid sequence diagrams from textual prompts. The benchmark consists of a core set of 132 samples, starting from a small set of manually crafted and verified flows. These were expanded via a hybrid methodology combining human annotation, in-context LLM prompting, and rule-based variation generation. Our benchmark uses an LLM-as-a-judge model to assess Mermaid sequence diagram generation across fine-grained metrics, including syntax correctness, activation handling, error handling, and practical usability. We perform initial evaluations on numerous state-of-the-art LLMs and utilize multiple LLM judge models to demonstrate the effectiveness and flexibility of our benchmark. Our results reveal significant capability gaps across models and evaluation modes. Our proposed benchmark provides a foundation for advancing research in structured diagram generation and for developing more rigorous, fine-grained evaluation methodologies.

Problem

Research questions and friction points this paper is trying to address.

Evaluating LLM correctness in generating Mermaid sequence diagrams from text

Addressing the lack of systematic benchmarks for structured diagram generation

Assessing fine-grained metrics like syntax and error handling in diagrams

Innovation

Methods, ideas, or system contributions that make the work stand out.

Hybrid human annotation and LLM prompting

Rule-based variation generation for data expansion

LLM-as-a-judge model for fine-grained evaluation

🔎 Similar Papers

Benchmarking Agentic Workflow Generation