See it. Say it. Sorted: Agentic System for Compositional Diagram Generation

📅 2025-08-21
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address diffusion models’ limitations in spatial precision, symbol alignment, and compositional semantics for hand-drawn sketch-to-structured diagram (e.g., flowchart) conversion, this paper proposes a training-free vision-language collaborative agent framework. The method integrates a vision-language model (VLM) and a large language model (LLM) within a “critique-generate-discriminate” iterative loop, enabling qualitative reasoning to preserve global layout constraints. It supports multi-strategy SVG code generation, precise reconstruction of complex primitives (e.g., multi-head arrows), and human-feedback-driven interactive editing—all without numerical optimization. Outputs are editable vector graphics directly compatible with tools such as PowerPoint. Evaluated on ten real-world flowchart sketches sourced from academic papers, our approach significantly outperforms GPT-5 and Gemini-2.5-Pro, achieving zero redundant text and high-fidelity structural reconstruction.

Technology Category

Application Category

📝 Abstract
We study sketch-to-diagram generation: converting rough hand sketches into precise, compositional diagrams. Diffusion models excel at photorealism but struggle with the spatial precision, alignment, and symbolic structure required for flowcharts. We introduce See it. Say it. Sorted., a training-free agentic system that couples a Vision-Language Model (VLM) with Large Language Models (LLMs) to produce editable Scalable Vector Graphics (SVG) programs. The system runs an iterative loop in which a Critic VLM proposes a small set of qualitative, relational edits; multiple candidate LLMs synthesize SVG updates with diverse strategies (conservative->aggressive, alternative, focused); and a Judge VLM selects the best candidate, ensuring stable improvement. This design prioritizes qualitative reasoning over brittle numerical estimates, preserves global constraints (e.g., alignment, connectivity), and naturally supports human-in-the-loop corrections. On 10 sketches derived from flowcharts in published papers, our method more faithfully reconstructs layout and structure than two frontier closed-source image generation LLMs (GPT-5 and Gemini-2.5-Pro), accurately composing primitives (e.g., multi-headed arrows) without inserting unwanted text. Because outputs are programmatic SVGs, the approach is readily extensible to presentation tools (e.g., PowerPoint) via APIs and can be specialized with improved prompts and task-specific tools. The codebase is open-sourced at https://github.com/hantaoZhangrichard/see_it_say_it_sorted.git.
Problem

Research questions and friction points this paper is trying to address.

Converting rough hand sketches into precise compositional diagrams
Addressing spatial precision and symbolic structure in flowchart generation
Producing editable SVG programs through iterative agentic systems
Innovation

Methods, ideas, or system contributions that make the work stand out.

Agentic system combines VLM and LLMs
Iterative loop with critic and judge VLMs
Generates editable SVG programs from sketches
🔎 Similar Papers
No similar papers found.
H
Hantao Zhang
Department of Statistics and Data Science, Yale University, New Haven, CT 06510
J
Jingyang Liu
School of Informatics, University of Edinburgh, Edinburgh, EH8 9AB
Ed Li
Ed Li
Yale University
agentic systemsai4scienceautoML