Towards Efficient Online Tuning of VLM Agents via Counterfactual Soft Reinforcement Learning

📅 2025-05-01
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address the low exploration efficiency and slow policy convergence of vision-language model (VLM) agents in online reinforcement learning fine-tuning—caused by their open-ended textual action space—this paper proposes a counterfactual reasoning–based soft reinforcement learning framework. Our core innovation is the first-ever counterfactual token importance scoring mechanism, which dynamically identifies critical action subsequences via token-level causal effect modeling, thereby compressing the exploration space while preserving semantic integrity. We provide theoretical guarantees on monotonic policy improvement and convergence. The framework integrates soft policy optimization with online rollout uncertainty calibration. Evaluated across diverse tasks—including Android UI control, card games, and embodied AI—we achieve a 23.6% average improvement in task completion rate and reduce required training samples by 41%.

Technology Category

Application Category

📝 Abstract
Online fine-tuning vision-language model (VLM) agents with reinforcement learning (RL) has shown promise for equipping agents with multi-step, goal-oriented capabilities in dynamic environments. However, their open-ended textual action space and non-end-to-end nature of action generation present significant challenges to effective online exploration in RL, e.g., explosion of the exploration space. We propose a novel online fine-tuning method, Counterfactual Soft Reinforcement Learning (CoSo), better suited to the textual output space of VLM agents. Compared to prior methods that assign uniform uncertainty to all tokens, CoSo leverages counterfactual reasoning to dynamically assess the causal influence of individual tokens on post-processed actions. By prioritizing the exploration of action-critical tokens while reducing the impact of semantically redundant or low-impact tokens, CoSo enables a more targeted and efficient online rollout process. We provide theoretical analysis proving CoSo's convergence and policy improvement guarantees, and extensive empirical evaluations supporting CoSo's effectiveness. Our results across a diverse set of agent tasks, including Android device control, card gaming, and embodied AI, highlight its remarkable ability to enhance exploration efficiency and deliver consistent performance gains. The code is available at https://github.com/langfengQ/CoSo.
Problem

Research questions and friction points this paper is trying to address.

Addresses inefficient online exploration in VLM agents' textual action space
Reduces impact of redundant tokens in reinforcement learning fine-tuning
Improves exploration efficiency for multi-step goal-oriented agent tasks
Innovation

Methods, ideas, or system contributions that make the work stand out.

Uses counterfactual reasoning for token influence assessment
Prioritizes exploration of action-critical tokens
Enhances efficiency in online rollout process
🔎 Similar Papers
No similar papers found.
Lang Feng
Lang Feng
Nanyang Technological University
Reinforcement Learning
W
Weihao Tan
Nanyang Technological University
Zhiyi Lyu
Zhiyi Lyu
Nanyang Technological University
Large language modelLLM Alignment
Longtao Zheng
Longtao Zheng
PhD student, NTU Singapore
artificial intelligenceagentsreinforcement learning
H
Haiyang Xu
Alibaba Group
M
Ming Yan
Alibaba Group
F
Fei Huang
Alibaba Group
B
Bo An
Nanyang Technological University, Skywork AI