🤖 AI Summary
To address diffusion models’ limitations in spatial precision, symbol alignment, and compositional semantics for hand-drawn sketch-to-structured diagram (e.g., flowchart) conversion, this paper proposes a training-free vision-language collaborative agent framework. The method integrates a vision-language model (VLM) and a large language model (LLM) within a “critique-generate-discriminate” iterative loop, enabling qualitative reasoning to preserve global layout constraints. It supports multi-strategy SVG code generation, precise reconstruction of complex primitives (e.g., multi-head arrows), and human-feedback-driven interactive editing—all without numerical optimization. Outputs are editable vector graphics directly compatible with tools such as PowerPoint. Evaluated on ten real-world flowchart sketches sourced from academic papers, our approach significantly outperforms GPT-5 and Gemini-2.5-Pro, achieving zero redundant text and high-fidelity structural reconstruction.
📝 Abstract
We study sketch-to-diagram generation: converting rough hand sketches into precise, compositional diagrams. Diffusion models excel at photorealism but struggle with the spatial precision, alignment, and symbolic structure required for flowcharts. We introduce See it. Say it. Sorted., a training-free agentic system that couples a Vision-Language Model (VLM) with Large Language Models (LLMs) to produce editable Scalable Vector Graphics (SVG) programs. The system runs an iterative loop in which a Critic VLM proposes a small set of qualitative, relational edits; multiple candidate LLMs synthesize SVG updates with diverse strategies (conservative->aggressive, alternative, focused); and a Judge VLM selects the best candidate, ensuring stable improvement. This design prioritizes qualitative reasoning over brittle numerical estimates, preserves global constraints (e.g., alignment, connectivity), and naturally supports human-in-the-loop corrections. On 10 sketches derived from flowcharts in published papers, our method more faithfully reconstructs layout and structure than two frontier closed-source image generation LLMs (GPT-5 and Gemini-2.5-Pro), accurately composing primitives (e.g., multi-headed arrows) without inserting unwanted text. Because outputs are programmatic SVGs, the approach is readily extensible to presentation tools (e.g., PowerPoint) via APIs and can be specialized with improved prompts and task-specific tools. The codebase is open-sourced at https://github.com/hantaoZhangrichard/see_it_say_it_sorted.git.