Thinking with Images as Continuous Actions: Numerical Visual Chain-of-Thought

📅 2026-02-27

📈 Citations: 0

✨ Influential: 0

career value

200K/year

🤖 AI Summary

Existing visual reasoning approaches are hindered by modality misalignment, semantic fragmentation, and insufficient localization accuracy. This work proposes the Numerical Visual Chain-of-Thought (NV-CoT) framework, which extends the action space of multimodal large language models from discrete vocabularies to continuous Euclidean space, enabling direct generation of fine-grained bounding box coordinates via Gaussian (or Laplacian) policies. Requiring only minimal architectural modifications, NV-CoT is compatible with both supervised fine-tuning and reinforcement learning paradigms such as GRPO, and leverages reparameterized sampling for end-to-end optimization. Evaluated on three benchmarks, the method substantially outperforms eight strong baselines, achieving notable improvements in localization precision, answer accuracy, and training convergence speed.

Technology Category

Application Category

📝 Abstract

Recent multimodal large language models (MLLMs) increasingly rely on visual chain-of-thought to perform region-grounded reasoning over images. However, existing approaches ground regions via either textified coordinates-causing modality mismatch and semantic fragmentation or fixed-granularity patches that both limit precise region selection and often require non-trivial architectural changes. In this paper, we propose Numerical Visual Chain-of-Thought (NV-CoT), a framework that enables MLLMs to reason over images using continuous numerical coordinates. NV-CoT expands the MLLM action space from discrete vocabulary tokens to a continuous Euclidean space, allowing models to directly generate bounding-box coordinates as actions with only minimal architectural modification. The framework supports both supervised fine-tuning and reinforcement learning. In particular, we replace categorical token policies with a Gaussian (or Laplace) policy over coordinates and introduce stochasticity via reparameterized sampling, making NV-CoT fully compatible with GRPO-style policy optimization. Extensive experiments on three benchmarks against eight representative visual reasoning baselines demonstrate that NV-CoT significantly improves localization precision and final answer accuracy, while also accelerating training convergence, validating the effectiveness of continuous-action visual reasoning in MLLMs. The code is available in https://github.com/kesenzhao/NV-CoT.

Problem

Research questions and friction points this paper is trying to address.

visual chain-of-thought

region grounding

multimodal large language models

continuous actions

localization precision

Innovation

Methods, ideas, or system contributions that make the work stand out.

continuous action space

numerical visual chain-of-thought

multimodal large language models