🤖 AI Summary
This work addresses the problem of generating dance motions from first-person-view (FPV) videos jointly conditioned on accompanying music, tackling two key challenges: severe occlusion-induced inaccuracies in full-body pose estimation from FPV data, and cross-modal temporal alignment between visual and auditory signals. We propose Skeleton Mamba, the first architecture to explicitly apply state space models (SSMs)—specifically Mamba—to skeletal sequence modeling, integrated with a diffusion-based framework for multimodal temporal fusion. To our knowledge, this is the first end-to-end method achieving FPV- and music-driven dance generation on the EgoAIST++ dataset. By synergistically leveraging self-attention and structured state updates, Skeleton Mamba effectively captures long-range spatiotemporal dependencies. Extensive experiments demonstrate significant improvements over state-of-the-art methods in both synthetic and real-world scenarios, with strong generalization capability and practical deployment potential.
📝 Abstract
Estimating human dance motion is a challenging task with various industrial applications. Recently, many efforts have focused on predicting human dance motion using either egocentric video or music as input. However, the task of jointly estimating human motion from both egocentric video and music remains largely unexplored. In this paper, we aim to develop a new method that predicts human dance motion from both egocentric video and music. In practice, the egocentric view often obscures much of the body, making accurate full-pose estimation challenging. Additionally, incorporating music requires the generated head and body movements to align well with both visual and musical inputs. We first introduce EgoAIST++, a new large-scale dataset that combines both egocentric views and music with more than 36 hours of dancing motion. Drawing on the success of diffusion models and Mamba on modeling sequences, we develop an EgoMusic Motion Network with a core Skeleton Mamba that explicitly captures the skeleton structure of the human body. We illustrate that our approach is theoretically supportive. Intensive experiments show that our method clearly outperforms state-of-the-art approaches and generalizes effectively to real-world data.