CoMaTrack: Competitive Multi-Agent Game-Theoretic Tracking with Vision-Language-Action Models

📅 2026-03-24
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing embodied visual tracking methods rely on single-agent imitation learning, which suffers from limited generalization due to dependence on costly expert demonstrations and static environments. This work proposes the first competitive multi-agent game-theoretic framework, where agents enhance adaptive planning and robustness through subtask-level competition in dynamic adversarial settings. We introduce a new benchmark, CoMaTrack-Bench, and integrate a vision-language-action model with a game-theory-driven multi-agent reinforcement learning paradigm. Evaluated on EVT-Bench, our trained 3B-parameter model outperforms prior 7B-parameter single-agent approaches, achieving state-of-the-art results with 92.1% success rate (STT), 74.2% dynamic tracking accuracy (DT), and 57.5% action trajectory fidelity (AT), while also setting a new performance frontier on CoMaTrack-Bench.

Technology Category

Application Category

📝 Abstract
Embodied Visual Tracking (EVT), a core dynamic task in embodied intelligence, requires an agent to precisely follow a language-specified target. Yet most existing methods rely on single-agent imitation learning, suffering from costly expert data and limited generalization due to static training environments. Inspired by competition-driven capability evolution, we propose CoMaTrack, a competitive game-theoretic multi-agent reinforcement learning framework that trains agents in a dynamic adversarial setting with competitive subtasks, yielding stronger adaptive planning and interference-resilient strategies. We further introduce CoMaTrack-Bench, the first benchmark for competitive EVT, featuring game scenarios between a tracker and adaptive opponents across diverse environments and instructions, enabling standardized robustness evaluation under active adversarial interactions. Experiments show that CoMaTrack achieves state-of-the-art results on both standard benchmarks and CoMaTrack-Bench. Notably, a 3B VLM trained with our framework surpasses previous single-agent imitation learning methods based on 7B models on the challenging EVT-Bench, achieving 92.1% in STT, 74.2% in DT, and 57.5% in AT. The benchmark code will be available at https://github.com/wlqcode/CoMaTrack-Bench
Problem

Research questions and friction points this paper is trying to address.

Embodied Visual Tracking
imitation learning
expert data
generalization
static environments
Innovation

Methods, ideas, or system contributions that make the work stand out.

competitive multi-agent reinforcement learning
game-theoretic tracking
embodied visual tracking
vision-language-action models
adversarial benchmark
🔎 Similar Papers
No similar papers found.
Y
Youzhi Liu
Amap, Alibaba Group
L
Li Gao
Amap, Alibaba Group
L
Liu Liu
Amap, Alibaba Group
M
Mingyang Lv
Amap, Alibaba Group
Yang Cai
Yang Cai
Professor of Computer Science and Economics, Yale University
Theoretical Computer ScienceAlgorithmic Game TheoryMechanism DesignLearning