COVR:Collaborative Optimization of VLMs and RL Agent for Visual-Based Control

📅 2026-01-04
🏛️ arXiv.org
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the low sample efficiency of visual reinforcement learning due to high-dimensional observations and the underutilization of reinforcement learning (RL) interaction data for enhancing vision-language models (VLMs). The authors propose COVR, a framework that enables the first bidirectional co-optimization between VLMs and RL agents. On one hand, RL-generated data is leveraged to fine-tune the VLM through exploration-driven dynamic filtering and reward-aware adaptive loss weighting, thereby improving its task-relevant semantic reasoning. On the other hand, the enhanced VLM provides action priors to guide policy learning, with progressive fine-tuning employed to reduce computational overhead. Experiments demonstrate that COVR significantly improves both performance and generalization across a range of complex visual control tasks.

Technology Category

Application Category

📝 Abstract
Visual reinforcement learning (RL) suffers from poor sample efficiency due to high-dimensional observations in complex tasks. While existing works have shown that vision-language models (VLMs) can assist RL, they often focus on knowledge distillation from the VLM to RL, overlooking the potential of RL-generated interaction data to enhance the VLM. To address this, we propose COVR, a collaborative optimization framework that enables the mutual enhancement of the VLM and RL policies. Specifically, COVR fine-tunes the VLM with RL-generated data to enhance the semantic reasoning ability consistent with the target task, and uses the enhanced VLM to further guide policy learning via action priors. To improve fine-tuning efficiency, we introduce two key modules: (1) an Exploration-Driven Dynamic Filter module that preserves valuable exploration samples using adaptive thresholds based on the degree of exploration, and (2) a Return-Aware Adaptive Loss Weight module that improves the stability of training by quantifying the inconsistency of sampling actions via return signals of RL. We further design a progressive fine-tuning strategy to reduce resource consumption. Extensive experiments show that COVR achieves strong performance across various challenging visual control tasks.
Problem

Research questions and friction points this paper is trying to address.

visual reinforcement learning
vision-language models
sample efficiency
mutual enhancement
high-dimensional observations
Innovation

Methods, ideas, or system contributions that make the work stand out.

Collaborative Optimization
Vision-Language Models
Visual Reinforcement Learning
Action Priors
Adaptive Fine-tuning
🔎 Similar Papers
No similar papers found.
C
Canming Xia
School of Intelligent Systems Engineering, Shenzhen Campus of Sun Yat-sen University, China
P
Peixi Peng
School of Electronic and Computer Engineering, Shenzhen Graduate School, Peking University, China
Guang Tan
Guang Tan
School of Intelligent Systems Engineering, Sun Yat-sen Unversity
Machine LearningMobile ComputingNetworking
Zhan Su
Zhan Su
University of Montreal;MILA
PEFT approachesLLMsInformation retrieval
H
Haoran Xu
School of Intelligent Systems Engineering, Shenzhen Campus of Sun Yat-sen University, China
Z
Zhenxian Liu
National Engineering Research Center of Visual Technology, School of Computer Science, Peking University, China
L
Luntong Li
Peng Cheng Laboratory, China