TrackVLA: Embodied Visual Tracking in the Wild

📅 2025-05-29

📈 Citations: 0

✨ Influential: 0

career value

200K/year

🤖 AI Summary

This work addresses embodied target tracking under dynamic, heavily occluded scenarios. We propose the first unified Vision-Language-Action architecture: a shared large language model (LLM) backbone enables zero-shot target recognition via joint language modeling, while an anchor-based diffusion model ensures robust trajectory planning. We introduce EVT-Bench—the first large-scale embodied visual tracking benchmark—comprising 1.7 million synthetic and real-world samples, supporting both joint training and generalization evaluation. Our method achieves state-of-the-art performance in zero-shot settings, operates at 10 FPS inference speed, and significantly improves recognition accuracy and trajectory planning robustness in highly dynamic and strongly occluded environments.

Technology Category

Application Category

📝 Abstract

Embodied visual tracking is a fundamental skill in Embodied AI, enabling an agent to follow a specific target in dynamic environments using only egocentric vision. This task is inherently challenging as it requires both accurate target recognition and effective trajectory planning under conditions of severe occlusion and high scene dynamics. Existing approaches typically address this challenge through a modular separation of recognition and planning. In this work, we propose TrackVLA, a Vision-Language-Action (VLA) model that learns the synergy between object recognition and trajectory planning. Leveraging a shared LLM backbone, we employ a language modeling head for recognition and an anchor-based diffusion model for trajectory planning. To train TrackVLA, we construct an Embodied Visual Tracking Benchmark (EVT-Bench) and collect diverse difficulty levels of recognition samples, resulting in a dataset of 1.7 million samples. Through extensive experiments in both synthetic and real-world environments, TrackVLA demonstrates SOTA performance and strong generalizability. It significantly outperforms existing methods on public benchmarks in a zero-shot manner while remaining robust to high dynamics and occlusion in real-world scenarios at 10 FPS inference speed. Our project page is: https://pku-epic.github.io/TrackVLA-web.

Problem

Research questions and friction points this paper is trying to address.

Embodied visual tracking in dynamic environments using egocentric vision

Synergy between object recognition and trajectory planning under occlusion

Robust performance in high-dynamics and occlusion scenarios

Innovation

Methods, ideas, or system contributions that make the work stand out.

Vision-Language-Action model for tracking

Shared LLM backbone for recognition and planning

Anchor-based diffusion model for trajectory planning

🔎 Similar Papers

Enhancing Screen Time Identification in Children with a Multi-View Vision Language Model and Screen Time Tracker