EgoMusic-driven Human Dance Motion Estimation with Skeleton Mamba

📅 2025-08-14

📈 Citations: 0

✨ Influential: 0

career value

243K/year

🤖 AI Summary

This work addresses the problem of generating dance motions from first-person-view (FPV) videos jointly conditioned on accompanying music, tackling two key challenges: severe occlusion-induced inaccuracies in full-body pose estimation from FPV data, and cross-modal temporal alignment between visual and auditory signals. We propose Skeleton Mamba, the first architecture to explicitly apply state space models (SSMs)—specifically Mamba—to skeletal sequence modeling, integrated with a diffusion-based framework for multimodal temporal fusion. To our knowledge, this is the first end-to-end method achieving FPV- and music-driven dance generation on the EgoAIST++ dataset. By synergistically leveraging self-attention and structured state updates, Skeleton Mamba effectively captures long-range spatiotemporal dependencies. Extensive experiments demonstrate significant improvements over state-of-the-art methods in both synthetic and real-world scenarios, with strong generalization capability and practical deployment potential.

Technology Category

Application Category

📝 Abstract

Estimating human dance motion is a challenging task with various industrial applications. Recently, many efforts have focused on predicting human dance motion using either egocentric video or music as input. However, the task of jointly estimating human motion from both egocentric video and music remains largely unexplored. In this paper, we aim to develop a new method that predicts human dance motion from both egocentric video and music. In practice, the egocentric view often obscures much of the body, making accurate full-pose estimation challenging. Additionally, incorporating music requires the generated head and body movements to align well with both visual and musical inputs. We first introduce EgoAIST++, a new large-scale dataset that combines both egocentric views and music with more than 36 hours of dancing motion. Drawing on the success of diffusion models and Mamba on modeling sequences, we develop an EgoMusic Motion Network with a core Skeleton Mamba that explicitly captures the skeleton structure of the human body. We illustrate that our approach is theoretically supportive. Intensive experiments show that our method clearly outperforms state-of-the-art approaches and generalizes effectively to real-world data.

Problem

Research questions and friction points this paper is trying to address.

Estimating human dance motion from egocentric video and music

Overcoming body obscurity in egocentric views for pose estimation

Aligning dance movements with both visual and musical inputs

Innovation

Methods, ideas, or system contributions that make the work stand out.

Combines egocentric video and music for dance estimation

Introduces EgoAIST++ dataset with 36+ hours of motion

Uses Skeleton Mamba to model human skeleton structure

🔎 Similar Papers

No similar papers found.