🤖 AI Summary
Existing embodied visual tracking methods rely on single-agent imitation learning, which suffers from limited generalization due to dependence on costly expert demonstrations and static environments. This work proposes the first competitive multi-agent game-theoretic framework, where agents enhance adaptive planning and robustness through subtask-level competition in dynamic adversarial settings. We introduce a new benchmark, CoMaTrack-Bench, and integrate a vision-language-action model with a game-theory-driven multi-agent reinforcement learning paradigm. Evaluated on EVT-Bench, our trained 3B-parameter model outperforms prior 7B-parameter single-agent approaches, achieving state-of-the-art results with 92.1% success rate (STT), 74.2% dynamic tracking accuracy (DT), and 57.5% action trajectory fidelity (AT), while also setting a new performance frontier on CoMaTrack-Bench.
📝 Abstract
Embodied Visual Tracking (EVT), a core dynamic task in embodied intelligence, requires an agent to precisely follow a language-specified target. Yet most existing methods rely on single-agent imitation learning, suffering from costly expert data and limited generalization due to static training environments. Inspired by competition-driven capability evolution, we propose CoMaTrack, a competitive game-theoretic multi-agent reinforcement learning framework that trains agents in a dynamic adversarial setting with competitive subtasks, yielding stronger adaptive planning and interference-resilient strategies. We further introduce CoMaTrack-Bench, the first benchmark for competitive EVT, featuring game scenarios between a tracker and adaptive opponents across diverse environments and instructions, enabling standardized robustness evaluation under active adversarial interactions. Experiments show that CoMaTrack achieves state-of-the-art results on both standard benchmarks and CoMaTrack-Bench. Notably, a 3B VLM trained with our framework surpasses previous single-agent imitation learning methods based on 7B models on the challenging EVT-Bench, achieving 92.1% in STT, 74.2% in DT, and 57.5% in AT. The benchmark code will be available at https://github.com/wlqcode/CoMaTrack-Bench