Vision2Code: A Multi-Domain Benchmark for Evaluating Image-to-Code Generation

📅 2026-05-11

📈 Citations: 0

✨ Influential: 0

career value

165K/year

🤖 AI Summary

Existing benchmarks for image-to-code generation are limited by narrow domain coverage, reliance on paired reference code, or the use of generic metrics that overlook domain-specific errors, hindering comprehensive model evaluation. This work proposes the first reference-free, multi-domain evaluation framework, encompassing 2,169 samples across 15 visual tasks—including charts, geometric figures, and 3D scenes—assessed via executable rendering, domain-customized scoring rules, and a deterministic semantic failure guardrail. Experiments across nine open- and closed-source models reveal pronounced domain dependency, with evaluation results showing strong alignment with human judgments and outperforming baselines based on generic visual scoring and embedding similarity. The framework also demonstrates that filtering model outputs can substantially improve performance, as evidenced by an increase in Qwen3.5-9B’s score from 1.60 to 1.86.

📝 Abstract

Image-to-code generation tests whether a vision-language model (VLM) can recover the structure of an image enough to express it as executable code. Existing benchmarks either focus on narrow visual domains, depend on paired executable reference code, or rely on generic rubrics that miss domain-specific reconstruction errors. We introduce Vision2Code, a reference-code-free benchmark and evaluation framework for multi-domain image-to-code generation. Vision2Code contains 2,169 test examples from 15 source datasets that span charts and plots, geometry, graphs, scientific imagery, documents, and 3D spatial scenes. Models generate executable programs, which we render and score against the source image using a VLM rater with dataset-specific rubrics and deterministic guardrails for severe semantic failures. We report render-success diagnostics that separate code execution failures from reconstruction quality. Human validation shows that this evaluation protocol aligns better with human judgments than either a generic visual rubric or embedding-similarity baselines. Across nine open-weight and proprietary models, we find that image-to-code performance is domain-dependent: leading models perform well on regular chart- and graph-like visuals but remain weak on spatial scenes, chemistry, documents, and circuit-style diagrams. Finally, we show that evaluator-filtered model outputs can serve as training data to improve image-to-code capability, with Qwen3.5-9B improving from 1.60 to 1.86 on the benchmark without paired source programs. Vision2Code provides a reproducible testbed for measuring, diagnosing, and improving image-to-code generation. Our code and data are publicly available at https://image2code.github.io/vision2code/.

Problem

Research questions and friction points this paper is trying to address.

image-to-code generation

vision-language models

multi-domain benchmark

evaluation framework

executable code

Innovation

Methods, ideas, or system contributions that make the work stand out.

image-to-code generation

reference-free evaluation

multi-domain benchmark