Not All Tokens See Equally: Perception-Grounded Policy Optimization for Large Vision-Language Models

๐Ÿ“… 2026-04-02
๐Ÿ“ˆ Citations: 0
โœจ Influential: 0
๐Ÿ“„ PDF
๐Ÿค– AI Summary
This work addresses the limitation of existing reinforcement learningโ€“based multimodal reasoning methods, which uniformly distribute advantage signals across all generated tokens, thereby diluting learning signals at visually critical steps. To resolve this, the authors propose the Perception-Guided Policy Optimization (PGPO) framework, which introduces the novel concept of Token Visual Dependency. PGPO quantifies the causal information gain of each token with respect to visual inputs using KL divergence and employs a threshold-gated, conservation-of-quality advantage reshaping mechanism to enable fine-grained credit assignment. This approach dynamically reweights token-level advantage signals while preserving linguistic priors. Experiments demonstrate that PGPO yields an average performance improvement of 18.7% across the Qwen2.5-VL model series and seven benchmarks, significantly reduces gradient variance, prevents training collapse, and enhances model robustness.
๐Ÿ“ Abstract
While Reinforcement Learning from Verifiable Rewards (RLVR) has advanced reasoning in Large Vision-Language Models (LVLMs), prevailing frameworks suffer from a foundational methodological flaw: by distributing identical advantages across all generated tokens, these methods inherently dilute the learning signals essential for optimizing the critical, visually-grounded steps of multimodal reasoning. To bridge this gap, we formulate \textit{Token Visual Dependency}, quantifying the causal information gain of visual inputs via the Kullback-Leibler (KL) divergence between visual-conditioned and text-only predictive distributions. Revealing that this dependency is highly sparse and semantically pivotal, we introduce Perception-Grounded Policy Optimization (PGPO), which is a novel fine-grained credit assignment framework that dynamically reshapes advantages at the token level. Through a threshold-gated, mass-conserving mechanism, PGPO actively amplifies learning signals for visually-dependent tokens while suppressing gradient noise from linguistic priors. Extensive experiments based on the Qwen2.5-VL series across seven challenging multimodal reasoning benchmarks demonstrate that PGPO boosts models by 18.7% on average. Both theoretical and empirical analyses confirm that PGPO effectively reduces gradient variance, prevents training collapse, and acts as a potent regularizer for robust, perception-grounded multimodal reasoning. Code will be published on https://github.com/Yzk1114/PGPO.
Problem

Research questions and friction points this paper is trying to address.

Large Vision-Language Models
Reinforcement Learning
Token-level Credit Assignment
Visual Dependency
Multimodal Reasoning
Innovation

Methods, ideas, or system contributions that make the work stand out.

Perception-Grounded Policy Optimization
Token Visual Dependency
Multimodal Reasoning
Credit Assignment
Reinforcement Learning from Verifiable Rewards
๐Ÿ”Ž Similar Papers
Z
Zekai Ye
Harbin Institute of Technology
Q
Qiming Li
Harbin Institute of Technology
Xiaocheng Feng
Xiaocheng Feng
Harbin Institute of Technology
NLPDeep Learning MachineLearning
R
Ruihan Chen
Harbin Institute of Technology
Z
Ziming Li
Huawei Technologies Co., Ltd
H
Haoyu Ren
Huawei Technologies Co., Ltd
K
Kun Chen
Huawei Technologies Co., Ltd
D
Dandan Tu
Huawei Technologies Co., Ltd
Bing Qin
Bing Qin
Professor in Harbin Institute of Technology
Natural Language ProcessingInformation ExtractionSentiment Analysis