An Online Reference-Free Evaluation Framework for Flowchart Image-to-Code Generation

📅 2026-02-13
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the challenge of evaluating the output quality of diagram-to-code generation systems in real-world production settings where reference code is unavailable. The authors propose a reference-free online evaluation framework that monitors generation quality at inference time using only the input diagram image and the generated code. Their approach innovatively combines OCR-based content recall with visual entailment (VE) precision into a unified metric, F1_OCR-VE: OCR-extracted text from the image serves as a proxy for content coverage, while visual entailment detects hallucinated elements in the generated code. Experiments on the FlowVQA dataset demonstrate that F1_OCR-VE correlates strongly with human judgments, achieving Pearson correlation coefficients of 0.97, 0.91, and 0.94 for recall, precision, and F1, respectively, thereby validating its effectiveness and reliability.

Technology Category

Application Category

📝 Abstract
Vision-Language Models (VLMs) are increasingly used in document processing pipelines to convert flowchart images into structured code (e.g., Mermaid). In production, these systems process arbitrary inputs for which no ground-truth code exists, making output quality difficult to assess. We propose a reference-free evaluation framework that monitors flowchart image-to-code generation quality at inference time, using only the input image and the generated output. The framework introduces two automated metrics: $\text{Recall}{\text{OCR}}$, which estimates content coverage by extracting text from the input image via OCR as a proxy reference, and $\text{Precision}{\text{VE}}$, which detects hallucinated elements through Visual Entailment against the original image. Their harmonic mean, $\text{F1}{\text{OCR-VE}}$, provides a unified quality score. Validation on the FlowVQA dataset shows strong agreement with ground-truth metrics (average Pearson's $r = 0.97$, $0.91$, and $0.94$ for Recall, Precision, and F1, respectively), confirming the framework's reliability as a practical, reference-free alternative for continuous quality monitoring in production settings.
Problem

Research questions and friction points this paper is trying to address.

reference-free evaluation
flowchart image-to-code generation
output quality assessment
vision-language models
document processing
Innovation

Methods, ideas, or system contributions that make the work stand out.

reference-free evaluation
flowchart-to-code generation
OCR-based recall
visual entailment
Vision-Language Models
🔎 Similar Papers
No similar papers found.
G
Giang Son Nguyen
Nanyang Technological University
Z
Zi Pong Lim
Nanyang Technological University
S
Sarthak Ketanbhai Modi
Nanyang Technological University
Y
Yon Shin Teo
AUMOVIO Singapore
Wenya Wang
Wenya Wang
Nanyang Technological University
Deep LearningKnowledge ReasoningNatural Language ProcessingSentiment Analysis