Hierarchical Instruction-aware Embodied Visual Tracking

📅 2025-05-27

📈 Citations: 0

✨ Influential: 0

career value

162K/year

🤖 AI Summary

User-command-driven embodied visual tracking (UC-EVT) suffers from a severe semantic gap between high-level natural language instructions and low-level robotic actions. Method: We propose a Semantic–Spatial Target Alignment framework that bridges this gap by using spatial coordinates as an intermediate representation: a large language model (LLM) precisely parses instructions into target coordinates, while offline reinforcement learning (RL) learns environment-agnostic tracking policies. Crucially, we introduce the first offline adaptive target alignment mechanism—enabling cross-environment generalization and real-time inference without online fine-tuning. Results: Our method significantly outperforms prior approaches across one seen and nine unseen complex environments. Extensive experiments validate its robustness to dynamic targets, compositional instructions, and real-world scenarios. To our knowledge, this is the first end-to-end solution for UC-EVT that simultaneously achieves strong generalization, computational efficiency, and practical deployability.

Technology Category

Application Category

📝 Abstract

User-Centric Embodied Visual Tracking (UC-EVT) presents a novel challenge for reinforcement learning-based models due to the substantial gap between high-level user instructions and low-level agent actions. While recent advancements in language models (e.g., LLMs, VLMs, VLAs) have improved instruction comprehension, these models face critical limitations in either inference speed (LLMs, VLMs) or generalizability (VLAs) for UC-EVT tasks. To address these challenges, we propose extbf{Hierarchical Instruction-aware Embodied Visual Tracking (HIEVT)} agent, which bridges instruction comprehension and action generation using extit{spatial goals} as intermediaries. HIEVT first introduces extit{LLM-based Semantic-Spatial Goal Aligner} to translate diverse human instructions into spatial goals that directly annotate the desired spatial position. Then the extit{RL-based Adaptive Goal-Aligned Policy}, a general offline policy, enables the tracker to position the target as specified by the spatial goal. To benchmark UC-EVT tasks, we collect over ten million trajectories for training and evaluate across one seen environment and nine unseen challenging environments. Extensive experiments and real-world deployments demonstrate the robustness and generalizability of HIEVT across diverse environments, varying target dynamics, and complex instruction combinations. The complete project is available at https://sites.google.com/view/hievt.

Problem

Research questions and friction points this paper is trying to address.

Bridging high-level user instructions and low-level agent actions

Overcoming limitations in inference speed and generalizability

Translating human instructions into spatial goals for tracking

Innovation

Methods, ideas, or system contributions that make the work stand out.

LLM-based Semantic-Spatial Goal Aligner

RL-based Adaptive Goal-Aligned Policy

Hierarchical Instruction-aware Embodied Visual Tracking

🔎 Similar Papers

No similar papers found.