Semantic Document Derendering: SVG Reconstruction via Vision-Language Modeling

📅 2025-11-17

📈 Citations: 0

✨ Influential: 0

career value

161K/year

🤖 AI Summary

Converting static raster documents (e.g., slides) into structured vector formats (SVG) often results in loss of high-level semantic structure and ambiguous separation of textual and graphical elements. Method: This paper proposes SliDer, the first framework to employ vision-language models (VLMs) for semantic-level document reverse rendering. It leverages multimodal understanding to accurately detect document elements, parse their semantic and layout attributes, and iteratively generate editable SVG code. To support this task, we introduce Slide2SVG—the first annotated dataset specifically designed for slide-to-SVG conversion—and evaluate reconstruction quality using perceptual metrics such as LPIPS. Contribution/Results: Experiments demonstrate that SliDer outperforms the strongest zero-shot VLM baseline on 82.9% of test samples, achieving an LPIPS score of 0.069. The method significantly improves structural fidelity, visual quality, and post-conversion editability of generated SVGs.

Technology Category

Application Category

📝 Abstract

Multimedia documents such as slide presentations and posters are designed to be interactive and easy to modify. Yet, they are often distributed in a static raster format, which limits editing and customization. Restoring their editability requires converting these raster images back into structured vector formats. However, existing geometric raster-vectorization methods, which rely on low-level primitives like curves and polygons, fall short at this task. Specifically, when applied to complex documents like slides, they fail to preserve the high-level structure, resulting in a flat collection of shapes where the semantic distinction between image and text elements is lost. To overcome this limitation, we address the problem of semantic document derendering by introducing SliDer, a novel framework that uses Vision-Language Models (VLMs) to derender slide images as compact and editable Scalable Vector Graphic (SVG) representations. SliDer detects and extracts attributes from individual image and text elements in a raster input and organizes them into a coherent SVG format. Crucially, the model iteratively refines its predictions during inference in a process analogous to human design, generating SVG code that more faithfully reconstructs the original raster upon rendering. Furthermore, we introduce Slide2SVG, a novel dataset comprising raster-SVG pairs of slide documents curated from real-world scientific presentations, to facilitate future research in this domain. Our results demonstrate that SliDer achieves a reconstruction LPIPS of 0.069 and is favored by human evaluators in 82.9% of cases compared to the strongest zero-shot VLM baseline.

Problem

Research questions and friction points this paper is trying to address.

Converting static raster slide images into editable vector SVG formats

Preserving semantic structure between image and text elements during conversion

Overcoming limitations of geometric methods that lose high-level document structure

Innovation

Methods, ideas, or system contributions that make the work stand out.

Uses Vision-Language Models for SVG reconstruction

Iteratively refines predictions during inference process

Detects and organizes elements into coherent SVG format

🔎 Similar Papers

Visually Descriptive Language Model for Vector Graphics Reasoning