CoMaTrack: Competitive Multi-Agent Game-Theoretic Tracking with Vision-Language-Action Models

📅 2026-03-24

📈 Citations: 0

✨ Influential: 0

career value

198K/year

🤖 AI Summary

Existing embodied visual tracking methods rely on single-agent imitation learning, which suffers from limited generalization due to dependence on costly expert demonstrations and static environments. This work proposes the first competitive multi-agent game-theoretic framework, where agents enhance adaptive planning and robustness through subtask-level competition in dynamic adversarial settings. We introduce a new benchmark, CoMaTrack-Bench, and integrate a vision-language-action model with a game-theory-driven multi-agent reinforcement learning paradigm. Evaluated on EVT-Bench, our trained 3B-parameter model outperforms prior 7B-parameter single-agent approaches, achieving state-of-the-art results with 92.1% success rate (STT), 74.2% dynamic tracking accuracy (DT), and 57.5% action trajectory fidelity (AT), while also setting a new performance frontier on CoMaTrack-Bench.

Technology Category

Application Category

📝 Abstract

Embodied Visual Tracking (EVT), a core dynamic task in embodied intelligence, requires an agent to precisely follow a language-specified target. Yet most existing methods rely on single-agent imitation learning, suffering from costly expert data and limited generalization due to static training environments. Inspired by competition-driven capability evolution, we propose CoMaTrack, a competitive game-theoretic multi-agent reinforcement learning framework that trains agents in a dynamic adversarial setting with competitive subtasks, yielding stronger adaptive planning and interference-resilient strategies. We further introduce CoMaTrack-Bench, the first benchmark for competitive EVT, featuring game scenarios between a tracker and adaptive opponents across diverse environments and instructions, enabling standardized robustness evaluation under active adversarial interactions. Experiments show that CoMaTrack achieves state-of-the-art results on both standard benchmarks and CoMaTrack-Bench. Notably, a 3B VLM trained with our framework surpasses previous single-agent imitation learning methods based on 7B models on the challenging EVT-Bench, achieving 92.1% in STT, 74.2% in DT, and 57.5% in AT. The benchmark code will be available at https://github.com/wlqcode/CoMaTrack-Bench

Problem

Research questions and friction points this paper is trying to address.

Embodied Visual Tracking

imitation learning

expert data

generalization

static environments

Innovation

Methods, ideas, or system contributions that make the work stand out.

competitive multi-agent reinforcement learning

game-theoretic tracking

embodied visual tracking