Towards Efficient Online Tuning of VLM Agents via Counterfactual Soft Reinforcement Learning

📅 2025-05-01

📈 Citations: 0

✨ Influential: 0

career value

209K/year

🤖 AI Summary

To address the low exploration efficiency and slow policy convergence of vision-language model (VLM) agents in online reinforcement learning fine-tuning—caused by their open-ended textual action space—this paper proposes a counterfactual reasoning–based soft reinforcement learning framework. Our core innovation is the first-ever counterfactual token importance scoring mechanism, which dynamically identifies critical action subsequences via token-level causal effect modeling, thereby compressing the exploration space while preserving semantic integrity. We provide theoretical guarantees on monotonic policy improvement and convergence. The framework integrates soft policy optimization with online rollout uncertainty calibration. Evaluated across diverse tasks—including Android UI control, card games, and embodied AI—we achieve a 23.6% average improvement in task completion rate and reduce required training samples by 41%.

Technology Category

Application Category

📝 Abstract

Online fine-tuning vision-language model (VLM) agents with reinforcement learning (RL) has shown promise for equipping agents with multi-step, goal-oriented capabilities in dynamic environments. However, their open-ended textual action space and non-end-to-end nature of action generation present significant challenges to effective online exploration in RL, e.g., explosion of the exploration space. We propose a novel online fine-tuning method, Counterfactual Soft Reinforcement Learning (CoSo), better suited to the textual output space of VLM agents. Compared to prior methods that assign uniform uncertainty to all tokens, CoSo leverages counterfactual reasoning to dynamically assess the causal influence of individual tokens on post-processed actions. By prioritizing the exploration of action-critical tokens while reducing the impact of semantically redundant or low-impact tokens, CoSo enables a more targeted and efficient online rollout process. We provide theoretical analysis proving CoSo's convergence and policy improvement guarantees, and extensive empirical evaluations supporting CoSo's effectiveness. Our results across a diverse set of agent tasks, including Android device control, card gaming, and embodied AI, highlight its remarkable ability to enhance exploration efficiency and deliver consistent performance gains. The code is available at https://github.com/langfengQ/CoSo.

Problem

Research questions and friction points this paper is trying to address.

Addresses inefficient online exploration in VLM agents' textual action space

Reduces impact of redundant tokens in reinforcement learning fine-tuning

Improves exploration efficiency for multi-step goal-oriented agent tasks

Innovation

Methods, ideas, or system contributions that make the work stand out.

Uses counterfactual reasoning for token influence assessment

Prioritizes exploration of action-critical tokens

Enhances efficiency in online rollout process

🔎 Similar Papers

Balance Reward and Safety Optimization for Safe Reinforcement Learning: A Perspective of Gradient Manipulation