TikZero: Zero-Shot Text-Guided Graphics Program Synthesis

📅 2025-03-14
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses zero-shot generation of compilable TikZ code from input images—without requiring paired text-code training data—while jointly optimizing geometric fidelity and program editability. We propose a cross-modal alignment framework mediated by image representations: CLIP is employed to extract joint embeddings of natural-language descriptions and rendered images; a decoupled training strategy jointly optimizes a symbolic program parser and multimodal encoders; and contrastive learning, together with syntactic constraints, ensures output correctness and structural validity. To our knowledge, this is the first method enabling zero-shot TikZ synthesis without any aligned image-code pairs, substantially outperforming supervised baselines relying solely on such data. When augmented with only a small number of aligned examples, our approach matches or surpasses state-of-the-art large language models (e.g., GPT-4o) in both accuracy and editability. The framework establishes a new paradigm for generating high-fidelity, human-editable vector graphics programmatically.

Technology Category

Application Category

📝 Abstract
With the rise of generative AI, synthesizing figures from text captions becomes a compelling application. However, achieving high geometric precision and editability requires representing figures as graphics programs in languages like TikZ, and aligned training data (i.e., graphics programs with captions) remains scarce. Meanwhile, large amounts of unaligned graphics programs and captioned raster images are more readily available. We reconcile these disparate data sources by presenting TikZero, which decouples graphics program generation from text understanding by using image representations as an intermediary bridge. It enables independent training on graphics programs and captioned images and allows for zero-shot text-guided graphics program synthesis during inference. We show that our method substantially outperforms baselines that can only operate with caption-aligned graphics programs. Furthermore, when leveraging caption-aligned graphics programs as a complementary training signal, TikZero matches or exceeds the performance of much larger models, including commercial systems like GPT-4o. Our code, datasets, and select models are publicly available.
Problem

Research questions and friction points this paper is trying to address.

Synthesizing high-precision, editable graphics from text captions.
Addressing scarcity of aligned graphics programs and captions.
Enabling zero-shot text-guided graphics program synthesis.
Innovation

Methods, ideas, or system contributions that make the work stand out.

Uses image representations as intermediary bridge
Enables zero-shot text-guided graphics program synthesis
Outperforms baselines with caption-aligned graphics programs
🔎 Similar Papers
No similar papers found.