Beyond Static Visual Tokens: Structured Sequential Visual Chain-of-Thought Reasoning

📅 2026-03-21
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the limitation of existing vision-language models, which encode images as static prefixes and thus struggle to support goal-driven, adaptive access to visual information. To overcome this, the authors propose Structured Sequential Visual Chain-of-Thought (SSV-CoT), a novel framework that emulates human visual attention by dynamically organizing salient image regions through question-guided saliency maps and performing structured, sequential multimodal reasoning in order of perceptual and semantic priority. The method operates end-to-end without requiring region-level annotations or external tools, jointly optimizing saliency generation, region selection, and textual chain-of-thought supervision. This enables effective modeling of both the spatial distribution of visual importance and its progressive semantic refinement. Extensive experiments demonstrate significant performance gains across multiple visual reasoning benchmarks, validating the efficacy of the proposed structured visual cognition mechanism.
📝 Abstract
Current multimodal LLMs encode images as static visual prefixes and rely on text-based reasoning, lacking goal-driven and adaptive visual access. Inspired by human visual perception-where attention is selectively and sequentially shifted from the most informative regions to secondary cues-we propose Structural Sequential Visual CoT SSV-CoT. First, a question-relevant saliency map identifies and organizes key visual regions, explicitly modeling the spatial distribution of visual importance. Second, reasoning is performed following this discriminative order, inducing a curriculum-like semantic progression from primary to secondary cues. This method is trained end-to-end, using text cot and answer supervision, without relying on region-level annotations or specialized external tools. Experiments on diverse visual reasoning benchmarks show gains, validating structured and sequential visual cognition.
Problem

Research questions and friction points this paper is trying to address.

multimodal LLMs
visual reasoning
static visual tokens
adaptive visual access
sequential attention
Innovation

Methods, ideas, or system contributions that make the work stand out.

Sequential Visual Reasoning
Saliency Map
Chain-of-Thought
Multimodal LLMs
Structured Visual Cognition
🔎 Similar Papers
No similar papers found.