MA-VLCM: A Vision Language Critic Model for Value Estimation of Policies in Multi-Agent Team Settings

📅 2026-03-16
📈 Citations: 0
Influential: 0
📄 PDF

career value

231K/year
🤖 AI Summary
This work addresses the limitations of traditional centralized critics in multi-agent reinforcement learning, which suffer from low sample efficiency, poor generalization, and deployment challenges in resource-constrained heterogeneous robotic systems. The authors propose MA-VLCM, a novel framework that leverages a pre-trained vision-language model (VLM) as a training-free centralized critic to estimate state values by integrating natural language task descriptions, visual trajectories, and multi-agent states. This approach significantly improves sample efficiency and cross-environment generalization while enabling the generation of lightweight policies. Experimental results demonstrate that MA-VLCM achieves strong zero-shot return prediction performance in both in-distribution and out-of-distribution multi-agent scenarios and is compatible with various VLM backbones.

Technology Category

Application Category

📝 Abstract
Multi-agent reinforcement learning (MARL) commonly relies on a centralized critic to estimate the value function. However, learning such a critic from scratch is highly sample-inefficient and often lacks generalization across environments. At the same time, large vision-language-action models (VLAs) trained on internet-scale data exhibit strong multimodal reasoning and zero-shot generalization capabilities, yet directly deploying them for robotic execution remains computationally prohibitive, particularly in heterogeneous multi-robot systems with diverse embodiments and resource constraints. To address these challenges, we propose Multi-Agent Vision-Language-Critic Models (MA-VLCM), a framework that replaces the learned centralized critic in MARL with a pretrained vision-language model fine-tuned to evaluate multi-agent behavior. MA-VLCM acts as a centralized critic conditioned on natural language task descriptions, visual trajectory observations, and structured multi-agent state information. By eliminating critic learning during policy optimization, our approach significantly improves sample efficiency while producing compact execution policies suitable for deployment on resource-constrained robots. Results show good zero-shot return estimation on models with differing VLM backbones on in-distribution and out-of-distribution scenarios in multi-agent team settings
Problem

Research questions and friction points this paper is trying to address.

multi-agent reinforcement learning
centralized critic
sample efficiency
zero-shot generalization
vision-language models
Innovation

Methods, ideas, or system contributions that make the work stand out.

vision-language model
multi-agent reinforcement learning
centralized critic
zero-shot generalization
sample efficiency