Flowing from Reasoning to Motion: Learning 3D Hand Trajectory Prediction from Egocentric Human Interaction Videos

📅 2025-12-18

📈 Citations: 0

✨ Influential: 0

career value

187K/year

🤖 AI Summary

Current 3D hand trajectory prediction methods face two key bottlenecks: (1) the absence of joint motion and semantic annotations in existing datasets, and (2) weak coupling between model reasoning and physical motion dynamics. To address these, we propose a semantic-aware 3D hand trajectory prediction framework for first-person interaction videos. Our contributions are threefold: (1) We introduce EgoMAN—the first dataset with explicit phase-level annotations of interactive actions; (2) We design a “reasoning-to-motion” trajectory-token interface that enables structured question-answering–driven joint modeling of semantics, spatial geometry, and motion dynamics; (3) We incorporate multimodal fusion, stage-wise alignment training, and 6DoF trajectory parameterization. Experiments demonstrate state-of-the-art accuracy in real-world scenarios, stage-aware trajectory forecasting, and significantly improved cross-scene generalization.

Technology Category

Application Category

📝 Abstract

Prior works on 3D hand trajectory prediction are constrained by datasets that decouple motion from semantic supervision and by models that weakly link reasoning and action. To address these, we first present the EgoMAN dataset, a large-scale egocentric dataset for interaction stage-aware 3D hand trajectory prediction with 219K 6DoF trajectories and 3M structured QA pairs for semantic, spatial, and motion reasoning. We then introduce the EgoMAN model, a reasoning-to-motion framework that links vision-language reasoning and motion generation via a trajectory-token interface. Trained progressively to align reasoning with motion dynamics, our approach yields accurate and stage-aware trajectories with generalization across real-world scenes.

Problem

Research questions and friction points this paper is trying to address.

Predicts 3D hand trajectories from egocentric videos

Links semantic reasoning with motion generation

Generalizes across real-world interaction scenes

Innovation

Methods, ideas, or system contributions that make the work stand out.

Large-scale egocentric dataset with structured QA pairs

Reasoning-to-motion framework linking vision-language and motion

Progressive training aligning reasoning with motion dynamics

🔎 Similar Papers

MADiff: Motion-Aware Mamba Diffusion Models for Hand Trajectory Prediction on Egocentric Videos

2024-09-04arXiv.orgCitations: 11