EgoEMG: A Multimodal Egocentric Dataset with Bilateral EMG and Vision for Hand Pose Estimation

📅 2026-05-07

📈 Citations: 0

✨ Influential: 0

career value

235K/year

🤖 AI Summary

This study addresses the scarcity of synchronized electromyography (EMG) and first-person visual data, which has hindered multimodal hand pose estimation research. We present the first large-scale, high-quality multimodal dataset comprising over 10 hours of synchronized recordings from 41 participants performing 60 distinct gestures, including bilateral high-density surface EMG (16 channels at 2 kHz), IMU signals, first-person RGB video, external RGB-D footage, and optical motion-captured hand poses. Based on this dataset, we introduce standardized cross-gesture, cross-subject, and joint generalization evaluation protocols, and propose a residual multimodal fusion architecture that significantly enhances performance over lightweight visual backbones (ResNet/ViT) combined with EMGFormer. Our work establishes a unified benchmark, demonstrating strong effectiveness across EMG-to-pose, vision-to-pose, and multimodal fusion tasks, thereby laying a foundation for future research in multimodal gesture recognition.

📝 Abstract

Surface electromyography (sEMG) records muscle activity during hand movement and can be decoded to recover detailed hand articulation. EMG and egocentric vision are complementary for hand sensing: EMG captures fine-grained finger articulation even under occlusion and poor lighting, while vision provides global hand configuration. However, no existing dataset synchronizes both modalities. We present EgoEMG, a multimodal egocentric dataset for bimanual hand pose estimation. EgoEMG includes bilateral wristband EMG with 16 total channels (8 per wrist) sampled at 2 kHz, 120 Hz IMU, egocentric wide-angle RGB video, external RGB-D video, and mocap-derived hand motion with wrist articulation angles. The dataset covers 41 participants performing 60 gesture classes, including 30 single-hand gestures and 30 bimanual gestures, totaling more than 10 hours of recording. We also introduce a benchmark with three tasks -- EMG-to-pose, vision-to-pose, and EMG+vision fusion -- under a shared joint-angle prediction target and common generalization split axes (cross-gesture, cross-user, and combined). As baselines, we evaluate EMGFormer for EMG-to-pose and generic ResNet/ViT backbones for vision-to-pose. We further study a residual fusion architecture that improves over matched lightweight vision-only baselines. Together, EgoEMG and its benchmark establish a foundation for future research on multimodal hand pose estimation with EMG and vision.

Problem

Research questions and friction points this paper is trying to address.

hand pose estimation

surface electromyography

egocentric vision

multimodal dataset

bimanual gestures

Innovation

Methods, ideas, or system contributions that make the work stand out.

multimodal hand pose estimation

surface electromyography (sEMG)

egocentric vision