🤖 AI Summary
This work addresses the limitation of existing vision-language models, which provide only textual answers without verifiable reasoning processes. The authors propose a training-free, model-agnostic framework that enables such models to generate editable SVG overlays—such as drawings and annotations—directly on input images to visually explain their responses. This approach supports both single-turn and multi-turn human-AI collaboration by producing non-destructive visual annotations that significantly enhance explanation credibility and interactivity. Experimental results across seven visual reasoning benchmarks demonstrate that the method improves task accuracy by up to 28.5 percentage points, achieves 1.48× higher annotation quality compared to baselines, and generates explanations more faithful to the model’s actual outputs.
📝 Abstract
When answering questions about images, humans naturally point, label, and draw to explain their reasoning. In contrast, modern vision-language models (VLMs) such as Gemini-3-Pro and GPT-5 only respond with text, which can be difficult for users to verify. We present SketchVLM, a training-free, model-agnostic framework that enables VLMs to produce non-destructive, editable SVG overlays on the input image to visually explain their answers. Across seven benchmarks spanning visual reasoning (maze navigation, ball-drop trajectory prediction, and object counting) and drawing (part labeling, connecting-the-dots, and drawing shapes around objects), SketchVLM improves visual reasoning task accuracy by up to +28.5 percentage points and annotation quality by up to 1.48x relative to image-editing and fine-tuned sketching baselines, while also producing annotations that are more faithful to the model's stated answer. We find that single-turn generation already achieves strong accuracy and annotation quality, and multi-turn generation opens up further opportunities for human-AI collaboration. An interactive demo and code are at https://sketchvlm.github.io/.