MoBind: Motion Binding for Fine-Grained IMU-Video Pose Alignment

📅 2026-02-21

📈 Citations: 0

✨ Influential: 0

career value

210K/year

🤖 AI Summary

This work addresses the challenge of cross-modal alignment between IMU signals and 2D video pose sequences by proposing a hierarchical contrastive learning framework. Instead of aligning IMU data with raw pixels, the method aligns them with skeletal motion sequences and further decomposes full-body motion into local body-part trajectories, enabling fine-grained pairing with corresponding IMU signals. By integrating hierarchical contrastive objectives that jointly model local and global motion dynamics, the approach effectively suppresses visual background distractions while capturing structural relationships across modalities. This enables accurate sub-second temporal synchronization, precise subject- and part-level localization, action recognition, and cross-modal retrieval. Extensive experiments on the mRI, TotalCapture, and EgoHumans datasets demonstrate consistent and significant improvements over existing methods across all four tasks.

Technology Category

Application Category

📝 Abstract

We aim to learn a joint representation between inertial measurement unit (IMU) signals and 2D pose sequences extracted from video, enabling accurate cross-modal retrieval, temporal synchronization, subject and body-part localization, and action recognition. To this end, we introduce MoBind, a hierarchical contrastive learning framework designed to address three challenges: (1) filtering out irrelevant visual background, (2) modeling structured multi-sensor IMU configurations, and (3) achieving fine-grained, sub-second temporal alignment. To isolate motion-relevant cues, MoBind aligns IMU signals with skeletal motion sequences rather than raw pixels. We further decompose full-body motion into local body-part trajectories, pairing each with its corresponding IMU to enable semantically grounded multi-sensor alignment. To capture detailed temporal correspondence, MoBind employs a hierarchical contrastive strategy that first aligns token-level temporal segments, then fuses local (body-part) alignment with global (body-wide) motion aggregation. Evaluated on mRi, TotalCapture, and EgoHumans, MoBind consistently outperforms strong baselines across all four tasks, demonstrating robust fine-grained temporal alignment while preserving coarse semantic consistency across modalities. Code is available at https://github.com/bbvisual/ MoBind.

Problem

Research questions and friction points this paper is trying to address.

IMU-video alignment

fine-grained temporal alignment

cross-modal retrieval

pose synchronization

multi-sensor fusion

Innovation

Methods, ideas, or system contributions that make the work stand out.

fine-grained alignment

hierarchical contrastive learning

IMU-video fusion