Symbolic Graphics Programming with Large Language Models

📅 2025-09-05
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work investigates large language models’ (LLMs) capability to generate executable, renderable symbolic graphic programs (SGPs)—i.e., syntactically correct and semantically accurate SVG code—from natural language descriptions, thereby probing their cross-modal visual understanding. To address prevalent issues including semantic inaccuracy, syntactic invalidity, and scene incoherence in generated outputs, we propose a novel reinforcement learning framework: (1) a syntax validity gating mechanism ensures renderability by filtering invalid SVG structures; and (2) a cross-modal alignment reward, grounded in SigLIP/DINO visual encoders, enables verifiable text–image semantic evaluation. Evaluated on Qwen-2.5-7B, our method achieves state-of-the-art performance among open-source models on the SGP-GenBench benchmark—matching top proprietary systems—and attains superior results across three core metrics: object decomposition, attribute binding, and scene coherence.

Technology Category

Application Category

📝 Abstract
Large language models (LLMs) excel at program synthesis, yet their ability to produce symbolic graphics programs (SGPs) that render into precise visual content remains underexplored. We study symbolic graphics programming, where the goal is to generate an SGP from a natural-language description. This task also serves as a lens into how LLMs understand the visual world by prompting them to generate images rendered from SGPs. Among various SGPs, our paper sticks to scalable vector graphics (SVGs). We begin by examining the extent to which LLMs can generate SGPs. To this end, we introduce SGP-GenBench, a comprehensive benchmark covering object fidelity, scene fidelity, and compositionality (attribute binding, spatial relations, numeracy). On SGP-GenBench, we discover that frontier proprietary models substantially outperform open-source models, and performance correlates well with general coding capabilities. Motivated by this gap, we aim to improve LLMs' ability to generate SGPs. We propose a reinforcement learning (RL) with verifiable rewards approach, where a format-validity gate ensures renderable SVG, and a cross-modal reward aligns text and the rendered image via strong vision encoders (e.g., SigLIP for text-image and DINO for image-image). Applied to Qwen-2.5-7B, our method substantially improves SVG generation quality and semantics, achieving performance on par with frontier systems. We further analyze training dynamics, showing that RL induces (i) finer decomposition of objects into controllable primitives and (ii) contextual details that improve scene coherence. Our results demonstrate that symbolic graphics programming offers a precise and interpretable lens on cross-modal grounding.
Problem

Research questions and friction points this paper is trying to address.

Generate symbolic graphics programs from natural language descriptions
Improve LLMs' ability to create precise scalable vector graphics
Bridge performance gap between proprietary and open-source models
Innovation

Methods, ideas, or system contributions that make the work stand out.

Reinforcement learning with verifiable rewards
Format-validity gate ensures renderable SVG
Cross-modal reward aligns text and image
🔎 Similar Papers
2024-08-15International Conference on Learning RepresentationsCitations: 11