HuMoCon: Concept Discovery for Human Motion Understanding

📅 2025-05-27
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This paper addresses two key challenges in human motion analysis: (1) the lack of explicit alignment among multimodal features, and (2) the loss of high-frequency motion details—such as joint velocities—under standard masked autoencoding frameworks. To this end, we propose a motion-video bimodal self-supervised learning framework. Methodologically, we (1) design a dual-path motion-video architecture with an explicit cross-modal feature alignment mechanism to enforce semantic consistency, and (2) introduce a velocity-guided high-frequency reconstruction loss to preserve dynamic motion details and mitigate temporal over-smoothing. Our framework integrates multimodal encoders, a cross-modal alignment module, masked motion modeling, and velocity-aware reconstruction objectives. Extensive experiments on standard benchmarks demonstrate significant improvements over state-of-the-art methods, enhancing both pretraining efficiency for large models and the interpretability of learned motion concepts. The code will be publicly released.

Technology Category

Application Category

📝 Abstract
We present HuMoCon, a novel motion-video understanding framework designed for advanced human behavior analysis. The core of our method is a human motion concept discovery framework that efficiently trains multi-modal encoders to extract semantically meaningful and generalizable features. HuMoCon addresses key challenges in motion concept discovery for understanding and reasoning, including the lack of explicit multi-modality feature alignment and the loss of high-frequency information in masked autoencoding frameworks. Our approach integrates a feature alignment strategy that leverages video for contextual understanding and motion for fine-grained interaction modeling, further with a velocity reconstruction mechanism to enhance high-frequency feature expression and mitigate temporal over-smoothing. Comprehensive experiments on standard benchmarks demonstrate that HuMoCon enables effective motion concept discovery and significantly outperforms state-of-the-art methods in training large models for human motion understanding. We will open-source the associated code with our paper.
Problem

Research questions and friction points this paper is trying to address.

Lack of explicit multi-modality feature alignment in motion understanding
Loss of high-frequency information in autoencoding frameworks
Need for effective human motion concept discovery and reasoning
Innovation

Methods, ideas, or system contributions that make the work stand out.

Multi-modal encoders for semantic feature extraction
Feature alignment strategy with video and motion
Velocity reconstruction to enhance high-frequency features
🔎 Similar Papers
No similar papers found.