🤖 AI Summary
This paper addresses two key challenges in human motion analysis: (1) the lack of explicit alignment among multimodal features, and (2) the loss of high-frequency motion details—such as joint velocities—under standard masked autoencoding frameworks. To this end, we propose a motion-video bimodal self-supervised learning framework. Methodologically, we (1) design a dual-path motion-video architecture with an explicit cross-modal feature alignment mechanism to enforce semantic consistency, and (2) introduce a velocity-guided high-frequency reconstruction loss to preserve dynamic motion details and mitigate temporal over-smoothing. Our framework integrates multimodal encoders, a cross-modal alignment module, masked motion modeling, and velocity-aware reconstruction objectives. Extensive experiments on standard benchmarks demonstrate significant improvements over state-of-the-art methods, enhancing both pretraining efficiency for large models and the interpretability of learned motion concepts. The code will be publicly released.
📝 Abstract
We present HuMoCon, a novel motion-video understanding framework designed for advanced human behavior analysis. The core of our method is a human motion concept discovery framework that efficiently trains multi-modal encoders to extract semantically meaningful and generalizable features. HuMoCon addresses key challenges in motion concept discovery for understanding and reasoning, including the lack of explicit multi-modality feature alignment and the loss of high-frequency information in masked autoencoding frameworks. Our approach integrates a feature alignment strategy that leverages video for contextual understanding and motion for fine-grained interaction modeling, further with a velocity reconstruction mechanism to enhance high-frequency feature expression and mitigate temporal over-smoothing. Comprehensive experiments on standard benchmarks demonstrate that HuMoCon enables effective motion concept discovery and significantly outperforms state-of-the-art methods in training large models for human motion understanding. We will open-source the associated code with our paper.