Flowing from Reasoning to Motion: Learning 3D Hand Trajectory Prediction from Egocentric Human Interaction Videos

📅 2025-12-18
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Current 3D hand trajectory prediction methods face two key bottlenecks: (1) the absence of joint motion and semantic annotations in existing datasets, and (2) weak coupling between model reasoning and physical motion dynamics. To address these, we propose a semantic-aware 3D hand trajectory prediction framework for first-person interaction videos. Our contributions are threefold: (1) We introduce EgoMAN—the first dataset with explicit phase-level annotations of interactive actions; (2) We design a “reasoning-to-motion” trajectory-token interface that enables structured question-answering–driven joint modeling of semantics, spatial geometry, and motion dynamics; (3) We incorporate multimodal fusion, stage-wise alignment training, and 6DoF trajectory parameterization. Experiments demonstrate state-of-the-art accuracy in real-world scenarios, stage-aware trajectory forecasting, and significantly improved cross-scene generalization.

Technology Category

Application Category

📝 Abstract
Prior works on 3D hand trajectory prediction are constrained by datasets that decouple motion from semantic supervision and by models that weakly link reasoning and action. To address these, we first present the EgoMAN dataset, a large-scale egocentric dataset for interaction stage-aware 3D hand trajectory prediction with 219K 6DoF trajectories and 3M structured QA pairs for semantic, spatial, and motion reasoning. We then introduce the EgoMAN model, a reasoning-to-motion framework that links vision-language reasoning and motion generation via a trajectory-token interface. Trained progressively to align reasoning with motion dynamics, our approach yields accurate and stage-aware trajectories with generalization across real-world scenes.
Problem

Research questions and friction points this paper is trying to address.

Predicts 3D hand trajectories from egocentric videos
Links semantic reasoning with motion generation
Generalizes across real-world interaction scenes
Innovation

Methods, ideas, or system contributions that make the work stand out.

Large-scale egocentric dataset with structured QA pairs
Reasoning-to-motion framework linking vision-language and motion
Progressive training aligning reasoning with motion dynamics