🤖 AI Summary
Current vision-language models often suffer from poor editability and low fidelity when handling structured, controllable diagram generation and editing tasks. This work proposes the Diagram-as-Code paradigm, which introduces symbolic logic into diagram synthesis for the first time, enabling precise and controllable visual content creation and modification through mxGraph XML representations. To systematically evaluate such capabilities, we construct VCG-Bench, a unified vision-centric benchmark encompassing both generation and editability tasks, and introduce multidimensional metrics including Execution Success Rate and Style Consistency Score. Experimental results reveal that state-of-the-art models still exhibit significant deficiencies in structural fidelity and instruction following, highlighting their limitations in structured visual understanding and reasoning.
📝 Abstract
Despite the rapid advancements in Vision-Language Models (VLMs), a critical gap remains in their ability to handle structured, controllable diagrammatic tasks essential for professional workflows. Existing methods predominantly rely on pixel-based synthesis, which operates in probabilistic pixel spaces and is inherently limited in editability and fidelity. Instead, we propose a new Diagram-as-Code paradigm with symbolic logic that leverages mxGraph Extensible Markup Language (XML) for precise diagram generation and editing. We present VCG-Bench, a unified benchmark for visual-centric \texttt{mxGraph} tasks. VCG-Bench comprises: (1) a taxonomized dataset of 1,449 diverse diagrams spanning 6 domains and 15 sub-domains, (2) a paradigm definition that integrates Generation (Vision-to-Code) and Editability (Code-to-Code), (3) a Tailored Evaluation Protocol employing multi-dimensional metrics such as \texttt{mxGraph} Execution Success Rate, Style Consistency Score (SCS), etc. Experimental results highlight the challenges faced by current State-of-the-Art (SOTA) VLMs in structured fidelity and instruction compliance, reflecting their vision and reasoning capabilities.