InterSketch: An Interleaved Reasoning Model with Self-correcting Visual Sketch and Stepwise Reward

📅 2026-05-26
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Current vision-language models exhibit shallow reasoning in complex tasks, overly relying on textual paradigms and struggling to achieve human-like long-horizon interleaved visual-textual reasoning. This work proposes InterSketch, the first framework endowed with self-correction capabilities for interleaved visual-textual reasoning. It dynamically generates intermediate visual sketches and iteratively fuses them with textual reasoning. During cold-start training, the model leverages synthetically generated high-quality visual-textual chain-of-thought (VT-CoT) data augmented with a reflection mechanism. In the reinforcement learning phase, stepwise rewards are introduced to mitigate reward sparsity. This approach substantially enhances both perceptual and logical reasoning in long-horizon visual understanding, outperforming state-of-the-art closed-source models—including Gemini-3-Pro—across multiple visual reasoning benchmarks.
📝 Abstract
While vision-language models (VLMs) have exhibited multi-turn visual reasoning capabilities, their reasoning trajectories remain relatively shallow and are dominated by a text-centric paradigm, limiting their applicability to complex visual challenges. In contrast, human-like thought typically involves long-horizon reasoning with an interleaved visual-textual chain-of-thought (VT-CoT). To bridge this gap, we introduce InterSketch, an interleaved reasoning model to enhance the VT-CoT capability via self-correcting and stepwise reward mechanisms. InterSketch dynamically generates intermediate visual sketches using external tools and interleaves them with textual reasoning, enabling effective perception and logical reasoning over long-horizon visual understanding tasks. Specifically, in the first cold-start stage, we propose a synthesized high-quality interleaved VT-CoT dataset and include a reflection mechanism to enable the model's capability in multi-turn interleaved reasoning and self-correction. In the subsequent reinforcement learning (RL) stage, we design a stepwise reward mechanism to mitigate the sparsity of reward signals inherent in end-only supervision over long-horizon reasoning. Extensive experiments on visual reasoning benchmarks demonstrate the effectiveness of InterSketch, even outperforming proprietary models such as Gemini-3-Pro.
Problem

Research questions and friction points this paper is trying to address.

visual reasoning
vision-language models
chain-of-thought
long-horizon reasoning
visual-textual interleaving
Innovation

Methods, ideas, or system contributions that make the work stand out.

interleaved reasoning
visual sketch
self-correcting mechanism
stepwise reward
vision-language models