🤖 AI Summary
Current 3D hand trajectory prediction methods face two key bottlenecks: (1) the absence of joint motion and semantic annotations in existing datasets, and (2) weak coupling between model reasoning and physical motion dynamics. To address these, we propose a semantic-aware 3D hand trajectory prediction framework for first-person interaction videos. Our contributions are threefold: (1) We introduce EgoMAN—the first dataset with explicit phase-level annotations of interactive actions; (2) We design a “reasoning-to-motion” trajectory-token interface that enables structured question-answering–driven joint modeling of semantics, spatial geometry, and motion dynamics; (3) We incorporate multimodal fusion, stage-wise alignment training, and 6DoF trajectory parameterization. Experiments demonstrate state-of-the-art accuracy in real-world scenarios, stage-aware trajectory forecasting, and significantly improved cross-scene generalization.
📝 Abstract
Prior works on 3D hand trajectory prediction are constrained by datasets that decouple motion from semantic supervision and by models that weakly link reasoning and action. To address these, we first present the EgoMAN dataset, a large-scale egocentric dataset for interaction stage-aware 3D hand trajectory prediction with 219K 6DoF trajectories and 3M structured QA pairs for semantic, spatial, and motion reasoning. We then introduce the EgoMAN model, a reasoning-to-motion framework that links vision-language reasoning and motion generation via a trajectory-token interface. Trained progressively to align reasoning with motion dynamics, our approach yields accurate and stage-aware trajectories with generalization across real-world scenes.