VCode: a Multimodal Coding Benchmark with SVG as Symbolic Visual Representation

📅 2025-11-04
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Prior AI code generation research focuses predominantly on language-centric tasks (e.g., program synthesis), neglecting vision-centric coding. Method: We introduce VCode—the first multimodal coding benchmark using executable SVG code as a visual representation, spanning commonsense reasoning, domain-specific knowledge, and visual perception. To evaluate and advance vision-aware code generation, we propose the CodeVQA assessment protocol and the VCoder framework, which integrates multiple visual tools (e.g., object detection, shape parsing) and employs iterative “reflection-and-refinement” alongside explicit “visual tool invocation” to enhance symbolic fidelity. Contribution/Results: Experiments show VCoder outperforms Claude-4-Opus by +12.3 percentage points across domains, significantly narrowing the modality gap between language and vision coding capabilities. Human evaluation confirms SVG’s effectiveness and promise as an interpretable, executable visual representation.

Technology Category

Application Category

📝 Abstract
Code has emerged as a precise and executable medium for reasoning and action in the agent era. Yet, progress has largely focused on language-centric tasks such as program synthesis and debugging, leaving visual-centric coding underexplored. Inspired by how humans reason over sketches, we advocate SVG code as a compact, interpretable, and executable visual representation. We introduce VCode, a benchmark that reframes multimodal understanding as code generation: given an image, a model must produce SVG that preserves symbolic meaning for downstream reasoning. VCode covers three domains - general commonsense (MM-Vet), professional disciplines (MMMU), and visual-centric perception (CV-Bench). To assess symbolic fidelity, we propose CodeVQA, a novel evaluation protocol in which a policy model answers questions over rendered SVGs; correct answers indicate faithful symbolic preservation. Empirically, frontier VLMs struggle to generate faithful SVGs, revealing a persistent gap between language-centric and visual-centric coding. To close this gap, we introduce VCoder, an agentic framework that augments VLMs along two axes: (i) Thinking with Revision, which iteratively analyzes discrepancies and refines SVG code; and (ii) Acting with Visual Tools, where detectors and parsers supply structured cues such as objects, shapes, and text beyond the model's intrinsic capacity. Across benchmarks, frontier VLMs with strong reasoning capabilities score well overall yet remain limited in professional knowledge and 3D reasoning. VCoder delivers a 12.3-point overall gain over the top-performing Claude-4-Opus. Human studies show that both humans and VLMs perform worse on rendered SVGs, their consistency reveals the promise of symbolic visual representation. The benchmark and code are available at https://github.com/CSU-JPG/VCode.
Problem

Research questions and friction points this paper is trying to address.

Addresses the underexplored area of visual-centric coding beyond language tasks
Proposes SVG as symbolic visual representation for multimodal understanding
Bridges the gap between language-centric and visual-centric coding capabilities
Innovation

Methods, ideas, or system contributions that make the work stand out.

Uses SVG code as symbolic visual representation
Introduces agentic framework with iterative revision process
Augments VLMs with visual tools for structured cues
🔎 Similar Papers
No similar papers found.
Kevin Qinghong Lin
Kevin Qinghong Lin
University of Oxford; National U. of Singapore
Vision and LanguageVideo UnderstandingAI Agent
Yuhao Zheng
Yuhao Zheng
University of Science and Technology of China
H
Hangyu Ran
Central South University
D
Dantong Zhu
Central South University
D
Dongxing Mao
Central South University
Linjie Li
Linjie Li
Microsoft
Vision and Language
P
Philip H. S. Torr
University of Oxford
A
Alex Jinpeng Wang
Central South University