VFIG: Vectorizing Complex Figures in SVG with Vision-Language Models

📅 2026-03-25

📈 Citations: 0

✨ Influential: 0

career value

218K/year

🤖 AI Summary

This work addresses the challenge of editing and scaling technical illustrations when only raster images (e.g., PNG/JPEG) are available due to missing original vector files. We propose the first vision–language model framework tailored for complex technical diagrams, enabling high-fidelity automatic vectorization into SVG format. Our approach combines supervised fine-tuning to learn atomic graphical primitives with reinforcement learning to optimize global layout and topological structure, employing a coarse-to-fine staged training strategy. To support this effort, we introduce VFIG-BENCH, a large-scale dataset of 66K high-quality SVG–image pairs along with a dedicated evaluation benchmark. Experiments show our model achieves a VLM-Judge score of 0.829 on VFIG-BENCH, matching the performance of GPT-5.2 and establishing a new state-of-the-art among open-source solutions.

Technology Category

Application Category

📝 Abstract

Scalable Vector Graphics (SVG) are an essential format for technical illustration and digital design, offering precise resolution independence and flexible semantic editability. In practice, however, original vector source files are frequently lost or inaccessible, leaving only "flat" rasterized versions (e.g., PNG or JPEG) that are difficult to modify or scale. Manually reconstructing these figures is a prohibitively labor-intensive process, requiring specialized expertise to recover the original geometric intent. To bridge this gap, we propose VFIG, a family of Vision-Language Models trained for complex and high-fidelity figure-to-SVG conversion. While this task is inherently data-driven, existing datasets are typically small-scale and lack the complexity of professional diagrams. We address this by introducing VFIG-DATA, a large-scale dataset of 66K high-quality figure-SVG pairs, curated from a diverse mix of real-world paper figures and procedurally generated diagrams. Recognizing that SVGs are composed of recurring primitives and hierarchical local structures, we introduce a coarse-to-fine training curriculum that begins with supervised fine-tuning (SFT) to learn atomic primitives and transitions to reinforcement learning (RL) refinement to optimize global diagram fidelity, layout consistency, and topological edge cases. Finally, we introduce VFIG-BENCH, a comprehensive evaluation suite with novel metrics designed to measure the structural integrity of complex figures. VFIG achieves state-of-the-art performance among open-source models and performs on par with GPT-5.2, achieving a VLM-Judge score of 0.829 on VFIG-BENCH.

Problem

Research questions and friction points this paper is trying to address.

SVG

vectorization

raster-to-vector conversion

technical illustration

figure reconstruction

Innovation

Methods, ideas, or system contributions that make the work stand out.

Vision-Language Models

SVG vectorization

coarse-to-fine training