Thinking with Images as Continuous Actions: Numerical Visual Chain-of-Thought

📅 2026-02-27
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing visual reasoning approaches are hindered by modality misalignment, semantic fragmentation, and insufficient localization accuracy. This work proposes the Numerical Visual Chain-of-Thought (NV-CoT) framework, which extends the action space of multimodal large language models from discrete vocabularies to continuous Euclidean space, enabling direct generation of fine-grained bounding box coordinates via Gaussian (or Laplacian) policies. Requiring only minimal architectural modifications, NV-CoT is compatible with both supervised fine-tuning and reinforcement learning paradigms such as GRPO, and leverages reparameterized sampling for end-to-end optimization. Evaluated on three benchmarks, the method substantially outperforms eight strong baselines, achieving notable improvements in localization precision, answer accuracy, and training convergence speed.

Technology Category

Application Category

📝 Abstract
Recent multimodal large language models (MLLMs) increasingly rely on visual chain-of-thought to perform region-grounded reasoning over images. However, existing approaches ground regions via either textified coordinates-causing modality mismatch and semantic fragmentation or fixed-granularity patches that both limit precise region selection and often require non-trivial architectural changes. In this paper, we propose Numerical Visual Chain-of-Thought (NV-CoT), a framework that enables MLLMs to reason over images using continuous numerical coordinates. NV-CoT expands the MLLM action space from discrete vocabulary tokens to a continuous Euclidean space, allowing models to directly generate bounding-box coordinates as actions with only minimal architectural modification. The framework supports both supervised fine-tuning and reinforcement learning. In particular, we replace categorical token policies with a Gaussian (or Laplace) policy over coordinates and introduce stochasticity via reparameterized sampling, making NV-CoT fully compatible with GRPO-style policy optimization. Extensive experiments on three benchmarks against eight representative visual reasoning baselines demonstrate that NV-CoT significantly improves localization precision and final answer accuracy, while also accelerating training convergence, validating the effectiveness of continuous-action visual reasoning in MLLMs. The code is available in https://github.com/kesenzhao/NV-CoT.
Problem

Research questions and friction points this paper is trying to address.

visual chain-of-thought
region grounding
multimodal large language models
continuous actions
localization precision
Innovation

Methods, ideas, or system contributions that make the work stand out.

continuous action space
numerical visual chain-of-thought
multimodal large language models
coordinate-based reasoning
reinforcement learning
🔎 Similar Papers
No similar papers found.
K
Kesen Zhao
Nanyang Technological University
Beier Zhu
Beier Zhu
Research Scientist, Nanyang Technological University
Robust Machine Learning
Junbao Zhou
Junbao Zhou
Ph.D Student
Computer Vision3D Vision
Xingyu Zhu
Xingyu Zhu
Princeton University
Z
Zhongqi Yue
University of Science and Technology of China
H
Hanwang Zhang
Nanyang Technological University