Generating 6DoF Object Manipulation Trajectories from Action Description in Egocentric Vision

📅 2025-06-04

📈 Citations: 0

✨ Influential: 0

career value

203K/year

🤖 AI Summary

This work addresses the challenge of generating physically plausible and semantically consistent 6-degree-of-freedom (6DoF) manipulation trajectories for interactive robots operating under egocentric (first-person) vision and guided by natural language instructions. We introduce the first egocentric 6DoF trajectory generation task and benchmark dataset, HOT3D. Methodologically, we propose an automated trajectory extraction framework built upon Exo-Ego4D multi-view videos, integrating a vision–point-cloud multimodal large language model, self-supervised trajectory alignment, cross-view action representation learning, and a text- and trajectory-conditioned generative architecture. Experiments demonstrate that our method produces trajectories on HOT3D with superior physical plausibility and semantic fidelity, significantly outperforming existing baselines. This work overcomes the bottleneck of labor-intensive, manual collection of fine-grained manipulation demonstrations, establishing a scalable training paradigm and providing an open-source baseline model for embodied intelligence.

Technology Category

Application Category

📝 Abstract

Learning to use tools or objects in common scenes, particularly handling them in various ways as instructed, is a key challenge for developing interactive robots. Training models to generate such manipulation trajectories requires a large and diverse collection of detailed manipulation demonstrations for various objects, which is nearly unfeasible to gather at scale. In this paper, we propose a framework that leverages large-scale ego- and exo-centric video datasets -- constructed globally with substantial effort -- of Exo-Ego4D to extract diverse manipulation trajectories at scale. From these extracted trajectories with the associated textual action description, we develop trajectory generation models based on visual and point cloud-based language models. In the recently proposed egocentric vision-based in-a-quality trajectory dataset of HOT3D, we confirmed that our models successfully generate valid object trajectories, establishing a training dataset and baseline models for the novel task of generating 6DoF manipulation trajectories from action descriptions in egocentric vision.

Problem

Research questions and friction points this paper is trying to address.

Generating 6DoF object manipulation trajectories from action descriptions

Learning diverse tool and object handling for interactive robots

Creating large-scale training data for manipulation trajectory models

Innovation

Methods, ideas, or system contributions that make the work stand out.

Leverages Exo-Ego4D video datasets

Uses visual and point cloud models

Generates 6DoF manipulation trajectories

🔎 Similar Papers

Vision-based Manipulation from Single Human Video with Open-World Object Graphs

2024-05-30arXiv.orgCitations: 40

Omnigrasp: Grasping Diverse Objects with Simulated Humanoids

2024-07-16Neural Information Processing SystemsCitations: 16