Reliable Reasoning in SVG-LLMs via Multi-Task Multi-Reward Reinforcement Learning

📅 2026-03-17
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing methods for SVG generation suffer from limited generalization, redundant code output, and a lack of explicit reasoning mechanisms. To address these limitations, this work proposes CTRL-S, a novel framework that introduces chain-of-thought (CoT) reasoning into SVG generation for the first time, explicitly modeling structured decision pathways during the generation process. We construct SVG-Sophia, a high-quality multitask dataset, and employ Group Relative Policy Optimization (GRPO) to jointly optimize multiple reward signals—including DINO visual features, image-text similarity, format compliance, and code efficiency. Experimental results demonstrate that CTRL-S significantly outperforms current approaches in task success rate, SVG code quality, and visual fidelity.

Technology Category

Application Category

📝 Abstract
With the rapid advancement of vision-language models, an increasing number of studies have explored their potential for SVG generation tasks. Although existing approaches improve performance by constructing large-scale SVG datasets and introducing SVG-specific tokens, they still suffer from limited generalization, redundant paths in code outputs, and a lack of explicit reasoning. In this work, we present CTRL-S (Chain-of-Thought Reinforcement Learning for SVG), a unified framework that introduces a chain-of-thought mechanism to explicitly expose the model's reasoning process during SVG generation. To support this structured reasoning, we construct SVG-Sophia, a high-quality dataset containing 145K samples across SVG code refinement, Text-to-SVG, and Image-to-SVG tasks. By training the model to generate group-level structured SVG code, CTRL-S significantly improves structural coherence and visual fidelity. Furthermore, we adopt the GRPO algorithm and design a multi-reward optimization framework, incorporating DINO, image-text similarity, format, and code efficiency rewards. Through joint multi-reward optimization and multi-task training, our approach systematically enhances overall generation capabilities. Extensive experiments show that CTRL-S outperforms existing methods, achieving higher task success rates, superior SVG code quality, and exceptional visual fidelity.
Problem

Research questions and friction points this paper is trying to address.

SVG generation
limited generalization
redundant paths
explicit reasoning
vision-language models
Innovation

Methods, ideas, or system contributions that make the work stand out.

Chain-of-Thought Reasoning
Multi-Reward Reinforcement Learning
Structured SVG Generation
Multi-Task Training
GRPO Algorithm