Unleashing Perception-Time Scaling to Multimodal Reasoning Models

📅 2025-10-09

📈 Citations: 0

✨ Influential: 0

career value

213K/year

🤖 AI Summary

Current large vision-language models (LVLMs) suffer from limited accuracy in visual perception tasks and exhibit marginal gains from inference-time scaling, primarily because their “fast perception” paradigm neglects the temporal nature of perception and the utility of intermediate representations. To address this, we propose Perception Temporal Scaling (PTS), a novel paradigm that introduces inference-time scaling to multimodal visual understanding for the first time. PTS explicitly decomposes perception into sequential subtasks and generates token-rich intermediate representations, enabling joint optimization of perception and reasoning. Our reinforcement learning–based PTS framework achieves a substantial improvement in high-precision rate—from 8.0% to 64.7%—on our newly constructed visual estimation benchmark, DisTANCE. Furthermore, PTS demonstrates strong generalization across domains and robust fine-grained image token attention on diverse real-world benchmarks.

Technology Category

Application Category

📝 Abstract

Recent advances in inference-time scaling, particularly those leveraging reinforcement learning with verifiable rewards, have substantially enhanced the reasoning capabilities of Large Vision-Language Models (LVLMs). Inspired by this success, similar strategies have been applied to multimodal reasoning, yet their impact on visual perception remains unclear. To investigate this gap, we introduce DisTANCE, a perception-centric benchmark for visual estimation tasks. Evaluation results show that LVLMs exhibit limited estimation precision, and inference-time scaling offers only marginal gains. We attribute this to the fast perception paradigm of current LVLMs, where visual understanding is treated as a one-shot output without modeling the underlying perceptual process. To address this, we propose Perception-Time Scaling (PTS), a novel paradigm that encourages token-rich perception and decomposes complex perception problems into intermediate tractable sub-problems, thereby enabling perception to align with and benefit from inference-time scaling. Combined with reinforcement learning techniques, PTS significantly improves perception accuracy, raising high-precision performance on DisTANCE from 8.0% to 64.7%, and generalizes well to out-of-domain tasks. Surprisingly, even though PTS data are purely synthetic, combining them with math reasoning data yields consistent gains in both reasoning and real-world perception benchmarks. Further analysis reveals that PTS introduces more perception-related tokens and increases the model's attention to image tokens. Our code and data will be publicly released.

Problem

Research questions and friction points this paper is trying to address.

Limited visual estimation precision in multimodal reasoning models

Fast perception paradigm treats vision as one-shot output

Current inference-time scaling offers marginal perception gains

Innovation

Methods, ideas, or system contributions that make the work stand out.

Introduces Perception-Time Scaling for multimodal reasoning

Decomposes perception into intermediate sub-problems

Combines synthetic data with reinforcement learning techniques

🔎 Similar Papers

No similar papers found.