Spotlight on Token Perception for Multimodal Reinforcement Learning

📅 2025-10-10

📈 Citations: 0

✨ Influential: 0

career value

207K/year

🤖 AI Summary

Existing RL-based vision-language reasoning (RLVR) methods largely overlook the fine-grained influence of visual perception on language generation and lack explicit modeling of token-level visual dependency. This work introduces the concept of *token-awareness*, quantifying the visual dependency strength for each generated token and revealing the sparsity and trajectory heterogeneity of visual utilization in multimodal reasoning. Building upon this insight, we propose Visually-Perceptive Policy Optimization (VPPO): (1) a chain-of-thought–guided mechanism to measure token-level visual awareness, and (2) a dual-policy update strategy that globally reweights the advantage function by visual dependency and applies policy updates exclusively to high-awareness tokens. Evaluated across eight multimodal benchmarks, VPPO consistently outperforms leading open-source RL fine-tuning models. Its effectiveness and scalability are validated at both 7B and 32B parameter scales.

Technology Category

Application Category

📝 Abstract

While Reinforcement Learning with Verifiable Rewards (RLVR) has advanced the reasoning capabilities of Large Vision-Language Models (LVLMs), most existing methods in multimodal reasoning neglect the critical role of visual perception within the RLVR optimization process. In this paper, we undertake a pioneering exploration of multimodal RLVR through the novel perspective of token perception, which measures the visual dependency of each generated token. With a granular analysis of Chain-of-Thought (CoT) processes, we uncover two key insights: first, token perception in a rollout trajectory is sparsely distributed, where only a small fraction of tokens have high visual dependency for visually-grounded reasoning; second, different trajectories exhibit significant divergence in their overall visual dependency. Based on these observations, we propose Visually-Perceptive Policy Optimization (VPPO), a novel policy gradient algorithm that explicitly leverages token perception to refine the learning signal. Specifically, VPPO achieves this through a dual mechanism: it reweights a trajectory's advantage by its overall visual dependency, and focuses policy updates exclusively on perceptually pivotal tokens. On a comprehensive suite of eight perception and reasoning benchmarks, VPPO demonstrates substantial gains over leading open-source RL-tuned models, with its effectiveness consistently validated across 7B and 32B model scales. Our findings not only establish a new token-level perceptual perspective for analyzing multimodal RLVR but also present a novel and effective optimization strategy to significantly enhance the multimodal reasoning capabilities of LVLMs.

Problem

Research questions and friction points this paper is trying to address.

Analyzing sparse visual token dependency in multimodal reinforcement learning trajectories

Addressing trajectory divergence in visual perception for grounded reasoning

Developing token-level optimization to enhance multimodal reasoning capabilities

Innovation

Methods, ideas, or system contributions that make the work stand out.

Token perception measures visual dependency of tokens

VPPO reweights trajectory advantage by visual dependency

VPPO focuses policy updates on perceptually pivotal tokens

🔎 Similar Papers

Don't flatten, tokenize! Unlocking the key to SoftMoE's efficacy in deep RL