🤖 AI Summary
This work addresses the challenge of cross-modal alignment between IMU signals and 2D video pose sequences by proposing a hierarchical contrastive learning framework. Instead of aligning IMU data with raw pixels, the method aligns them with skeletal motion sequences and further decomposes full-body motion into local body-part trajectories, enabling fine-grained pairing with corresponding IMU signals. By integrating hierarchical contrastive objectives that jointly model local and global motion dynamics, the approach effectively suppresses visual background distractions while capturing structural relationships across modalities. This enables accurate sub-second temporal synchronization, precise subject- and part-level localization, action recognition, and cross-modal retrieval. Extensive experiments on the mRI, TotalCapture, and EgoHumans datasets demonstrate consistent and significant improvements over existing methods across all four tasks.
📝 Abstract
We aim to learn a joint representation between inertial measurement unit (IMU) signals and 2D pose sequences extracted from video, enabling accurate cross-modal retrieval, temporal synchronization, subject and body-part localization, and action recognition. To this end, we introduce MoBind, a hierarchical contrastive learning framework designed to address three challenges: (1) filtering out irrelevant visual background, (2) modeling structured multi-sensor IMU configurations, and (3) achieving fine-grained, sub-second temporal alignment. To isolate motion-relevant cues, MoBind aligns IMU signals with skeletal motion sequences rather than raw pixels. We further decompose full-body motion into local body-part trajectories, pairing each with its corresponding IMU to enable semantically grounded multi-sensor alignment. To capture detailed temporal correspondence, MoBind employs a hierarchical contrastive strategy that first aligns token-level temporal segments, then fuses local (body-part) alignment with global (body-wide) motion aggregation. Evaluated on mRi, TotalCapture, and EgoHumans, MoBind consistently outperforms strong baselines across all four tasks, demonstrating robust fine-grained temporal alignment while preserving coarse semantic consistency across modalities. Code is available at https://github.com/bbvisual/ MoBind.