MoBind: Motion Binding for Fine-Grained IMU-Video Pose Alignment

📅 2026-02-21
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the challenge of cross-modal alignment between IMU signals and 2D video pose sequences by proposing a hierarchical contrastive learning framework. Instead of aligning IMU data with raw pixels, the method aligns them with skeletal motion sequences and further decomposes full-body motion into local body-part trajectories, enabling fine-grained pairing with corresponding IMU signals. By integrating hierarchical contrastive objectives that jointly model local and global motion dynamics, the approach effectively suppresses visual background distractions while capturing structural relationships across modalities. This enables accurate sub-second temporal synchronization, precise subject- and part-level localization, action recognition, and cross-modal retrieval. Extensive experiments on the mRI, TotalCapture, and EgoHumans datasets demonstrate consistent and significant improvements over existing methods across all four tasks.

Technology Category

Application Category

📝 Abstract
We aim to learn a joint representation between inertial measurement unit (IMU) signals and 2D pose sequences extracted from video, enabling accurate cross-modal retrieval, temporal synchronization, subject and body-part localization, and action recognition. To this end, we introduce MoBind, a hierarchical contrastive learning framework designed to address three challenges: (1) filtering out irrelevant visual background, (2) modeling structured multi-sensor IMU configurations, and (3) achieving fine-grained, sub-second temporal alignment. To isolate motion-relevant cues, MoBind aligns IMU signals with skeletal motion sequences rather than raw pixels. We further decompose full-body motion into local body-part trajectories, pairing each with its corresponding IMU to enable semantically grounded multi-sensor alignment. To capture detailed temporal correspondence, MoBind employs a hierarchical contrastive strategy that first aligns token-level temporal segments, then fuses local (body-part) alignment with global (body-wide) motion aggregation. Evaluated on mRi, TotalCapture, and EgoHumans, MoBind consistently outperforms strong baselines across all four tasks, demonstrating robust fine-grained temporal alignment while preserving coarse semantic consistency across modalities. Code is available at https://github.com/bbvisual/ MoBind.
Problem

Research questions and friction points this paper is trying to address.

IMU-video alignment
fine-grained temporal alignment
cross-modal retrieval
pose synchronization
multi-sensor fusion
Innovation

Methods, ideas, or system contributions that make the work stand out.

fine-grained alignment
hierarchical contrastive learning
IMU-video fusion
skeletal motion representation
multi-sensor synchronization
🔎 Similar Papers
No similar papers found.