TrackVLA++: Unleashing Reasoning and Memory Capabilities in VLA Models for Embodied Visual Tracking

📅 2025-10-08

📈 Citations: 0

✨ Influential: 0

career value

198K/year

🤖 AI Summary

To address embodied visual tracking failures caused by severe occlusion and visually similar distractors, this paper proposes a novel framework integrating spatial reasoning and long-term memory. Methodologically, we introduce Polar-CoT, a chain-of-thought (CoT) reasoning mechanism operating in polar coordinates to explicitly encode azimuth-distance priors for target localization. We further design a gated-update Target Identification Memory (TIM) module to maintain spatiotemporally consistent dynamic memory. The framework jointly models vision, language, and action, unifying CoT reasoning, polar-coordinate encoding, and gated memory updates. Evaluated on benchmarks including EVT-Bench DT, our approach outperforms state-of-the-art methods by 5.1–12.0 percentage points. It demonstrates significantly improved zero-shot generalization and robust tracking stability in realistic, dynamic environments.

Technology Category

Application Category

📝 Abstract

Embodied Visual Tracking (EVT) is a fundamental ability that underpins practical applications, such as companion robots, guidance robots and service assistants, where continuously following moving targets is essential. Recent advances have enabled language-guided tracking in complex and unstructured scenes. However, existing approaches lack explicit spatial reasoning and effective temporal memory, causing failures under severe occlusions or in the presence of similar-looking distractors. To address these challenges, we present TrackVLA++, a novel Vision-Language-Action (VLA) model that enhances embodied visual tracking with two key modules, a spatial reasoning mechanism and a Target Identification Memory (TIM). The reasoning module introduces a Chain-of-Thought paradigm, termed Polar-CoT, which infers the target's relative position and encodes it as a compact polar-coordinate token for action prediction. Guided by these spatial priors, the TIM employs a gated update strategy to preserve long-horizon target memory, ensuring spatiotemporal consistency and mitigating target loss during extended occlusions. Extensive experiments show that TrackVLA++ achieves state-of-the-art performance on public benchmarks across both egocentric and multi-camera settings. On the challenging EVT-Bench DT split, TrackVLA++ surpasses the previous leading approach by 5.1 and 12, respectively. Furthermore, TrackVLA++ exhibits strong zero-shot generalization, enabling robust real-world tracking in dynamic and occluded scenarios.

Problem

Research questions and friction points this paper is trying to address.

Enhancing spatial reasoning for visual tracking in complex scenes

Improving long-term memory to handle severe occlusions

Mitigating target loss caused by similar-looking distractors

Innovation

Methods, ideas, or system contributions that make the work stand out.

Spatial reasoning with Polar-CoT for position inference

Target Identification Memory with gated update strategy

Vision-Language-Action model for embodied visual tracking

🔎 Similar Papers

No similar papers found.