Rethinking Token-Level Policy Optimization for Multimodal Chain-of-Thought

📅 2026-03-24

📈 Citations: 0

✨ Influential: 0

career value

187K/year

🤖 AI Summary

This work addresses the limitation of existing reinforcement learning approaches in multimodal chain-of-thought reasoning, which rely on coarse-grained optimization and fail to differentiate reasoning steps based on their degree of visual grounding. To overcome this, we propose the Perception-Exploration Policy Optimization (PEPO) method, which introduces— for the first time—token-level perceptual priors and a smoothing gating mechanism to enable fine-grained policy updates without requiring additional supervision or auxiliary branches. PEPO constructs perceptual priors from hidden state similarities and combines token entropy to derive token-level advantage functions, making it compatible with mainstream RLVR frameworks such as GRPO and DAPO. Extensive experiments across multiple benchmarks—including geometric reasoning, visual grounding, visual puzzles, and few-shot classification—demonstrate that PEPO consistently yields significant performance gains while ensuring more stable training dynamics.

Technology Category

Application Category

📝 Abstract

Multimodal Chain-of-Thought (CoT) reasoning requires large vision-language models to construct reasoning trajectories that interleave perceptual grounding with multi-step inference. However, existing Reinforcement Learning with Verifiable Rewards (RLVR) methods typically optimize reasoning at a coarse granularity, treating CoT uniformly without distinguishing their varying degrees of visual grounding. In this work, we conduct a token-level analysis of multimodal reasoning trajectories and show that successful reasoning is characterized by structured token dynamics reflecting both perceptual grounding and exploratory inference. Building upon this analysis, we propose Perception-Exploration Policy Optimization (PEPO), which derives a perception prior from hidden state similarity and integrates it with token entropy through a smooth gating mechanism to produce token-level advantages. PEPO integrates seamlessly with existing RLVR frameworks such as GRPO and DAPO, requiring neither additional supervision nor auxiliary branches. Extensive experiments across diverse multimodal benchmarks demonstrate consistent and robust improvements over strong RL baselines, spanning geometry reasoning, visual grounding, visual puzzle solving, and few-shot classification, while maintaining stable training dynamics. Code: https://github.com/xzxxntxdy/PEPO

Problem

Research questions and friction points this paper is trying to address.

Multimodal Chain-of-Thought

token-level optimization

perceptual grounding

Reinforcement Learning with Verifiable Rewards

vision-language models

Innovation

Methods, ideas, or system contributions that make the work stand out.

Token-Level Policy Optimization

Multimodal Chain-of-Thought

Perception-Exploration Policy Optimization