Perception-Aware Multimodal Spatial Reasoning from Monocular Images

📅 2026-03-07

📈 Citations: 0

✨ Influential: 0

career value

198K/year

🤖 AI Summary

This work addresses the challenge of fine-grained geometric perception in monocular images, particularly under large scale variations or ambiguous appearances, where existing vision-language models struggle with precise spatial reasoning. The authors propose a multimodal reasoning framework that introduces Visual Reference Tokens (VRTs) to explicitly represent object-centric coordinates and, for the first time, models all VRTs within an object’s spatial extent as a unified entity. By integrating a deterministic ordering strategy with a Multimodal Chain-of-Thought (MM-CoT) dataset, the method enables joint autoregressive training of vision and language within a unified token space. Evaluated on the SURDS benchmark, the approach significantly outperforms existing methods—including those using reinforcement learning-based post-training—demonstrating substantial performance gains in both single-object and multi-object tasks.

Technology Category

Application Category

📝 Abstract

Spatial reasoning from monocular images is essential for autonomous driving, yet current Vision-Language Models (VLMs) still struggle with fine-grained geometric perception, particularly under large scale variation and ambiguous object appearance. We propose a simple yet effective perception-aware multimodal reasoning framework that equips VLMs with explicit object-centric grounding ability. Instead of relying on textual bounding-box outputs, each referred object is represented using all Visual Reference Tokens (VRTs) within its spatial extent, enabling visual evidence and textual reasoning to be processed jointly in a unified token space. To further strengthen cross-modal interaction, we construct a Multimodal Chain-of-Thought (MM-CoT) dataset that injects aligned visual and textual reasoning signals. A deterministic ordering strategy is introduced to make supervision over inherently unordered VRT sets fully compatible with the VLM's autoregressive next-token prediction. With only standard supervised fine-tuning, our method achieves substantial improvements on the SURDS benchmark, outperforming previous approaches - including those using RL-based post-training - by a large margin across both single-object and multi-object tasks. These results demonstrate that accurate perception and multimodal reasoning are mutually reinforcing, and together form the key to robust spatial understanding in challenging monocular driving scenarios.

Problem

Research questions and friction points this paper is trying to address.

spatial reasoning

monocular images

geometric perception

vision-language models

autonomous driving

Innovation

Methods, ideas, or system contributions that make the work stand out.

Visual Reference Tokens

Multimodal Chain-of-Thought

Object-Centric Grounding