Visual Scratchpads: Enabling Global Reasoning in Vision

📅 2024-10-10

🏛️ arXiv.org

📈 Citations: 0

✨ Influential: 0

career value

199K/year

🤖 AI Summary

Modern vision models excel at local feature recognition but exhibit poor generalization and low learning efficiency on tasks requiring global reasoning—such as maze pathfinding and image-grid relational modeling. To address this, we propose the *visual scratchpad*, a learnable mechanism that decomposes global problems into progressive spatial sub-steps and introduces a *degree of globality* metric to systematically evaluate model capability. Our key contributions are: (i) the first formalization of the visual scratchpad paradigm; (ii) theoretical and empirical demonstration that inductive scratchpad usage significantly improves zero-shot generalization and out-of-distribution robustness—especially for small models; and (iii) the construction of four novel global visual benchmarks, including maze navigation and connectivity reasoning tasks. Experiments show that our method boosts zero-shot accuracy by large margins, improves training efficiency by over 3×, and enhances out-of-distribution performance by 42%.

Technology Category

Application Category

📝 Abstract

Modern vision models have achieved remarkable success in benchmarks where local features provide critical information about the target. There is now a growing interest in solving tasks that require more global reasoning, where local features offer no significant information. These tasks are reminiscent of the connectivity tasks discussed by Minsky and Papert in 1969, which exposed the limitations of the perceptron model and contributed to the first AI winter. In this paper, we revisit such tasks by introducing four global visual benchmarks involving path findings and mazes. We show that: (1) although today's large vision models largely surpass the expressivity limitations of the early models, they still struggle with the learning efficiency; we put forward the"globality degree"notion to understand this limitation; (2) we then demonstrate that the picture changes and global reasoning becomes feasible with the introduction of"visual scratchpads"; similarly to the text scratchpads and chain-of-thoughts used in language models, visual scratchpads help break down global tasks into simpler ones; (3) we finally show that some scratchpads are better than others, in particular,"inductive scratchpads"that take steps relying on less information afford better out-of-distribution generalization and succeed for smaller model sizes.

Problem

Research questions and friction points this paper is trying to address.

Modern vision models struggle with global visual reasoning tasks

Multi-modal LLMs perform poorly on global visual datasets

Proposing chain-of-sketch to improve learning efficiency and generalization

Innovation

Methods, ideas, or system contributions that make the work stand out.

Chain-of-sketch breaks tasks into visual steps

Markovian structure improves CoS generalization

Inductive CoS enhances small model performance

🔎 Similar Papers

No similar papers found.