Agent Explorative Policy Optimization for Multimodal Agentic Reasoning

📅 2026-05-27

📈 Citations: 0

✨ Influential: 0

career value

172K/year

🤖 AI Summary

This work addresses the low tool utilization and high invocation error rates in multimodal agents caused by the “reasoning-action gap.” To bridge this gap, the authors propose the AXPO algorithm, which fixes the reasoning prefix preceding an erroneous tool call and resamples subsequent tool invocations and actions, augmented by an uncertainty-based prefix selection mechanism. This approach mitigates the structural asymmetry between reasoning and tool usage and strengthens training signals at critical decision points. Integrating supervised fine-tuning with AXPO-based reinforcement learning on the Qwen3-VL-Thinking model yields average improvements of 1.8 percentage points in both Pass@1 and Pass@4 across nine benchmarks. Notably, the 8B-parameter variant surpasses the 32B baseline in Pass@4 performance despite using only one-quarter of the parameters.

📝 Abstract

Vision-language models with extended reasoning succeed on complex problems, but many real-world problems require external tools that internal reasoning alone often cannot resolve. Agentic reasoning therefore interleaves two behaviors with a structural asymmetry: thinking (the self-contained default) and tool use (a high-variance auxiliary acting). We refer to this asymmetry as the Thinking-Acting Gap. Under standard RL recipes like GRPO, the gap manifests as two diagnostic symptoms during training: tool use is attempted on only ~30% of rollouts, and when attempted, the tool-using rollouts within a group are all-wrong on ~40% of questions, suppressing the learning signal at the tool calls that needed it. We propose AXPO (Agent eXplorative Policy Optimization): for each all-wrong tool-using subgroup, AXPO fixes the thinking prefix and resamples the tool call and its continuation, paired with uncertainty-based prefix selection. Across nine multimodal benchmarks and three scales of Qwen3-VL-Thinking, SFT+AXPO outperforms SFT+GRPO at average (+1.8pp Pass@1 and +1.8pp Pass@4 at 8B on average) and 8B with SFT+AXPO surpasses the 32B Base on Pass@4 with 4 times fewer parameters.

Problem

Research questions and friction points this paper is trying to address.

Thinking-Acting Gap

multimodal agentic reasoning

tool use

reinforcement learning

vision-language models

Innovation

Methods, ideas, or system contributions that make the work stand out.

AXPO

Thinking-Acting Gap

tool use