Spatially Prompted Visual Trajectory Prediction for Egocentric Manipulation

📅 2026-05-19

📈 Citations: 0

✨ Influential: 0

career value

167K/year

🤖 AI Summary

In cluttered environments, language instructions alone often fail to precisely specify manipulation targets and their locations. To address this challenge, this work proposes a novel paradigm—Spatial Prompt-based Visual Trajectory Prediction (SP-VTP)—which leverages spatial prompts (e.g., bounding boxes or points) in the first video frame to define the task goal and predicts the future end-effector trajectory of a robotic arm from egocentric visual observations. We formalize this task for the first time, introduce the EgoSPT dataset, and present the SPOT model, which aligns static spatial prompts with dynamic scene evolution through a task encoder that fuses initial-frame visual and spatial cues and an observation encoder that processes temporal context. Experiments under strict cross-scene splits demonstrate that SPOT significantly outperforms baselines without prompts or with single-source prompts, validating the effectiveness of spatial prompting in enhancing task scalability for embodied intelligence.

📝 Abstract

Robotic manipulation is often specified through language instructions or task identifiers, yet cluttered environments with similar objects are better handled by spatially indicating what to move and where to place it. Addressing the vision-centric challenge of object and goal specification, we present, to the best of our knowledge, the first formalization of Spatially Prompted Visual Trajectory Prediction (SP-VTP). This novel setting utilizes initial spatial prompts (like bounding boxes or points) to define task objectives, tasking the model with forecasting future end-effector trajectories from egocentric streams. To study this problem, we collect and annotate EgoSPT, a dataset of egocentric spatially prompted manipulation trajectories with first-frame object and target grounding annotations and recovered 3D end-effector motion. SP-VTP is challenging because the task specification is static, while the scene configuration evolves over time. To solve this problem, we propose SPOT(Spatially Prompted Object-Target Policy), which combines a task encoder for first-frame visual and coordinate spatial prompts, an observation encoder for current visual and history context, and a trajectory generator for future end-effector motion. Experiments under strict scene-level splits show that SPOT improves cross-scene trajectory prediction over non-prompted or single-source prompted baselines. Together, EgoSPT and SPOT establish a new spatial prompting problem SP-VTP, as a simple and scalable task condition for egocentric manipulation.

Problem

Research questions and friction points this paper is trying to address.

Spatial Prompting

Trajectory Prediction

Egocentric Manipulation

Visual Grounding

Robotics

Innovation

Methods, ideas, or system contributions that make the work stand out.

Spatially Prompted Visual Trajectory Prediction

Egocentric Manipulation

SPOT