COVR:Collaborative Optimization of VLMs and RL Agent for Visual-Based Control

📅 2026-01-04

🏛️ arXiv.org

📈 Citations: 0

✨ Influential: 0

career value

207K/year

🤖 AI Summary

This work addresses the low sample efficiency of visual reinforcement learning due to high-dimensional observations and the underutilization of reinforcement learning (RL) interaction data for enhancing vision-language models (VLMs). The authors propose COVR, a framework that enables the first bidirectional co-optimization between VLMs and RL agents. On one hand, RL-generated data is leveraged to fine-tune the VLM through exploration-driven dynamic filtering and reward-aware adaptive loss weighting, thereby improving its task-relevant semantic reasoning. On the other hand, the enhanced VLM provides action priors to guide policy learning, with progressive fine-tuning employed to reduce computational overhead. Experiments demonstrate that COVR significantly improves both performance and generalization across a range of complex visual control tasks.

Technology Category

Application Category

📝 Abstract

Visual reinforcement learning (RL) suffers from poor sample efficiency due to high-dimensional observations in complex tasks. While existing works have shown that vision-language models (VLMs) can assist RL, they often focus on knowledge distillation from the VLM to RL, overlooking the potential of RL-generated interaction data to enhance the VLM. To address this, we propose COVR, a collaborative optimization framework that enables the mutual enhancement of the VLM and RL policies. Specifically, COVR fine-tunes the VLM with RL-generated data to enhance the semantic reasoning ability consistent with the target task, and uses the enhanced VLM to further guide policy learning via action priors. To improve fine-tuning efficiency, we introduce two key modules: (1) an Exploration-Driven Dynamic Filter module that preserves valuable exploration samples using adaptive thresholds based on the degree of exploration, and (2) a Return-Aware Adaptive Loss Weight module that improves the stability of training by quantifying the inconsistency of sampling actions via return signals of RL. We further design a progressive fine-tuning strategy to reduce resource consumption. Extensive experiments show that COVR achieves strong performance across various challenging visual control tasks.

Problem

Research questions and friction points this paper is trying to address.

visual reinforcement learning

vision-language models

sample efficiency

mutual enhancement

high-dimensional observations

Innovation

Methods, ideas, or system contributions that make the work stand out.

Collaborative Optimization

Vision-Language Models

Visual Reinforcement Learning