MADiff: Motion-Aware Mamba Diffusion Models for Hand Trajectory Prediction on Egocentric Videos

πŸ“… 2024-09-04
πŸ›οΈ arXiv.org
πŸ“ˆ Citations: 11
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
Existing methods for hand trajectory prediction in first-person videos struggle to model high-order intent and temporal causality under camera egomotion interference and are hindered by the absence of affordance annotations. Method: We propose a motion-aware Mamba architecture featuring the novel Motion-Driven Selective Scanning (MDSS) mechanism, which explicitly incorporates wearer egomotion. Without requiring affordance labels, our approach implicitly learns hand–scene semantic relationships via vision-language foundation models (e.g., CLIP) and employs a diffusion model with latent-space denoising to generate physically plausible trajectories. Contribution/Results: Our method achieves state-of-the-art performance across five public benchmarks, supports real-time inference, and introduces new evaluation metrics to rigorously assess trajectory reasonableness and physical feasibility.

Technology Category

Application Category

πŸ“ Abstract
Understanding human intentions and actions through egocentric videos is important on the path to embodied artificial intelligence. As a branch of egocentric vision techniques, hand trajectory prediction plays a vital role in comprehending human motion patterns, benefiting downstream tasks in extended reality and robot manipulation. However, capturing high-level human intentions consistent with reasonable temporal causality is challenging when only egocentric videos are available. This difficulty is exacerbated under camera egomotion interference and the absence of affordance labels to explicitly guide the optimization of hand waypoint distribution. In this work, we propose a novel hand trajectory prediction method dubbed MADiff, which forecasts future hand waypoints with diffusion models. The devised denoising operation in the latent space is achieved by our proposed motion-aware Mamba, where the camera wearer's egomotion is integrated to achieve motion-driven selective scan (MDSS). To discern the relationship between hands and scenarios without explicit affordance supervision, we leverage a foundation model that fuses visual and language features to capture high-level semantics from video clips. Comprehensive experiments conducted on five public datasets with the existing and our proposed new evaluation metrics demonstrate that MADiff predicts comparably reasonable hand trajectories compared to the state-of-the-art baselines, and achieves real-time performance. We will release our code and pretrained models of MADiff at the project page: https://irmvlab.github.io/madiff.github.io.
Problem

Research questions and friction points this paper is trying to address.

Predicting hand trajectories from egocentric videos with camera motion interference
Capturing high-level human intentions without explicit affordance supervision
Modeling temporal causality in hand movements for embodied AI applications
Innovation

Methods, ideas, or system contributions that make the work stand out.

Motion-aware Mamba for selective scan denoising
Diffusion models forecast future hand waypoints
Visual-language foundation model captures high-level semantics
πŸ”Ž Similar Papers
No similar papers found.
J
Junyi Ma
IRMV Lab, School of Automation and Intelligent Sensing, Shanghai Jiao Tong University and State Key Laboratory of Avionics Integration and Aviation System-of-Systems Synthesis, Shanghai Key Laboratory of Navigation and Location Based Services, Shanghai 200240, China
Xieyuanli Chen
Xieyuanli Chen
Associate Professor, NUDT, China
RoboticsSLAMLocalizationLiDAR PerceptionRobot Learning
Wentao Bao
Wentao Bao
Research Scientist at Meta
Computer VisionMachine Learning
J
Jingyi Xu
Department of Electronic Engineering, Shanghai Jiao Tong University, Shanghai 200240, China
H
Hesheng Wang
IRMV Lab, School of Automation and Intelligent Sensing, Shanghai Jiao Tong University and State Key Laboratory of Avionics Integration and Aviation System-of-Systems Synthesis, Shanghai Key Laboratory of Navigation and Location Based Services, Shanghai 200240, China