See it. Say it. Sorted: Agentic System for Compositional Diagram Generation

📅 2025-08-21

📈 Citations: 0

✨ Influential: 0

career value

181K/year

🤖 AI Summary

To address diffusion models’ limitations in spatial precision, symbol alignment, and compositional semantics for hand-drawn sketch-to-structured diagram (e.g., flowchart) conversion, this paper proposes a training-free vision-language collaborative agent framework. The method integrates a vision-language model (VLM) and a large language model (LLM) within a “critique-generate-discriminate” iterative loop, enabling qualitative reasoning to preserve global layout constraints. It supports multi-strategy SVG code generation, precise reconstruction of complex primitives (e.g., multi-head arrows), and human-feedback-driven interactive editing—all without numerical optimization. Outputs are editable vector graphics directly compatible with tools such as PowerPoint. Evaluated on ten real-world flowchart sketches sourced from academic papers, our approach significantly outperforms GPT-5 and Gemini-2.5-Pro, achieving zero redundant text and high-fidelity structural reconstruction.

Technology Category

Application Category

📝 Abstract

We study sketch-to-diagram generation: converting rough hand sketches into precise, compositional diagrams. Diffusion models excel at photorealism but struggle with the spatial precision, alignment, and symbolic structure required for flowcharts. We introduce See it. Say it. Sorted., a training-free agentic system that couples a Vision-Language Model (VLM) with Large Language Models (LLMs) to produce editable Scalable Vector Graphics (SVG) programs. The system runs an iterative loop in which a Critic VLM proposes a small set of qualitative, relational edits; multiple candidate LLMs synthesize SVG updates with diverse strategies (conservative->aggressive, alternative, focused); and a Judge VLM selects the best candidate, ensuring stable improvement. This design prioritizes qualitative reasoning over brittle numerical estimates, preserves global constraints (e.g., alignment, connectivity), and naturally supports human-in-the-loop corrections. On 10 sketches derived from flowcharts in published papers, our method more faithfully reconstructs layout and structure than two frontier closed-source image generation LLMs (GPT-5 and Gemini-2.5-Pro), accurately composing primitives (e.g., multi-headed arrows) without inserting unwanted text. Because outputs are programmatic SVGs, the approach is readily extensible to presentation tools (e.g., PowerPoint) via APIs and can be specialized with improved prompts and task-specific tools. The codebase is open-sourced at https://github.com/hantaoZhangrichard/see_it_say_it_sorted.git.

Problem

Research questions and friction points this paper is trying to address.

Converting rough hand sketches into precise compositional diagrams

Addressing spatial precision and symbolic structure in flowchart generation

Producing editable SVG programs through iterative agentic systems

Innovation

Methods, ideas, or system contributions that make the work stand out.

Agentic system combines VLM and LLMs

Iterative loop with critic and judge VLMs

Generates editable SVG programs from sketches

🔎 Similar Papers

No similar papers found.