SVGCraft: Beyond Single Object Text-to-SVG Synthesis with Comprehensive Canvas Layout

📅 2024-03-30
🏛️ arXiv.org
📈 Citations: 1
Influential: 0
📄 PDF
🤖 AI Summary
Existing text-to-SVG methods are largely limited to single-object synthesis and struggle to generate multi-element vector scenes with controllable layout. This paper introduces the first end-to-end text-to-multi-object vector scene generation framework. Our method addresses this challenge through three core innovations: (1) text-driven fine-grained canvas layout planning; (2) mask-constrained latent-space localization and attention-based fusion; and (3) a canvas completion strategy grounded in primitive geometric shapes. The framework integrates a pretrained large language model (LLM), a diffusion-based U-Net, masked latent generation, and joint LPIPS+opacity optimization. Quantitative evaluation demonstrates state-of-the-art performance across abstraction, recognizability, and detail fidelity: CLIP-T score of 0.4563, cosine similarity of 0.6342, and aesthetic score of 6.7832—surpassing all prior approaches.

Technology Category

Application Category

📝 Abstract
Generating VectorArt from text prompts is a challenging vision task, requiring diverse yet realistic depictions of the seen as well as unseen entities. However, existing research has been mostly limited to the generation of single objects, rather than comprehensive scenes comprising multiple elements. In response, this work introduces SVGCraft, a novel end-to-end framework for the creation of vector graphics depicting entire scenes from textual descriptions. Utilizing a pre-trained LLM for layout generation from text prompts, this framework introduces a technique for producing masked latents in specified bounding boxes for accurate object placement. It introduces a fusion mechanism for integrating attention maps and employs a diffusion U-Net for coherent composition, speeding up the drawing process. The resulting SVG is optimized using a pre-trained encoder and LPIPS loss with opacity modulation to maximize similarity. Additionally, this work explores the potential of primitive shapes in facilitating canvas completion in constrained environments. Through both qualitative and quantitative assessments, SVGCraft is demonstrated to surpass prior works in abstraction, recognizability, and detail, as evidenced by its performance metrics (CLIP-T: 0.4563, Cosine Similarity: 0.6342, Confusion: 0.66, Aesthetic: 6.7832). The code will be available at https://github.com/ayanban011/SVGCraft.
Problem

Research questions and friction points this paper is trying to address.

Generating comprehensive vector scenes from text prompts
Overcoming single-object limitation in text-to-SVG synthesis
Creating accurate multi-element layouts with proper object placement
Innovation

Methods, ideas, or system contributions that make the work stand out.

LLM generates layouts from text prompts
Masked latents ensure accurate object placement
Diffusion U-Net enables coherent composition acceleration
A
Ayan Banerjee
Computer Vision Center, Universitat Autònoma de Barcelona, Spain
Nityanand Mathur
Nityanand Mathur
Data Scientist at smallest.ai
Deep LearningComputer VisionAudioExplainable AI
Josep Lladós
Josep Lladós
Computer Vision Center, Universitat Autònoma de Barcelona
Computer VisionPattern RecognitionDocument Analysis
U
Umapada Pal
CVPR Unit, Indian Statistical Institute Kolkata, India
A
Anjan Dutta
Centre for Vision, Speech and Signal Processing (CVSSP), University of Surrey