Leveraging RGB Images for Pre-Training of Event-Based Hand Pose Estimation

📅 2025-09-21

📈 Citations: 0

✨ Influential: 0

career value

166K/year

🤖 AI Summary

Event-driven hand pose estimation is hindered by the scarcity of labeled data, preventing full exploitation of its high temporal resolution and low-latency advantages. To address this, we propose a cross-modal self-supervised pre-training framework that leverages a small set of paired RGB images and a large collection of unpaired event streams. Our method decomposes hand motion progressively and enforces motion reversal constraints to synthesize high-fidelity pseudo-event data—thereby relaxing the restrictive static-hand assumption. We further introduce a temporal alignment mechanism and self-supervised regularization to effectively transfer pose annotations from RGB to the event domain. Evaluated on EvRealHands, our approach outperforms prior state-of-the-art methods by 24% and achieves superior accuracy with only minimal fine-tuning on scarce annotated samples, significantly enhancing practical deployability.

Technology Category

Application Category

📝 Abstract

This paper presents RPEP, the first pre-training method for event-based 3D hand pose estimation using labeled RGB images and unpaired, unlabeled event data. Event data offer significant benefits such as high temporal resolution and low latency, but their application to hand pose estimation is still limited by the scarcity of labeled training data. To address this, we repurpose real RGB datasets to train event-based estimators. This is done by constructing pseudo-event-RGB pairs, where event data is generated and aligned with the ground-truth poses of RGB images. Unfortunately, existing pseudo-event generation techniques assume stationary objects, thus struggling to handle non-stationary, dynamically moving hands. To overcome this, RPEP introduces a novel generation strategy that decomposes hand movements into smaller, step-by-step motions. This decomposition allows our method to capture temporal changes in articulation, constructing more realistic event data for a moving hand. Additionally, RPEP imposes a motion reversal constraint, regularizing event generation using reversed motion. Extensive experiments show that our pre-trained model significantly outperforms state-of-the-art methods on real event data, achieving up to 24% improvement on EvRealHands. Moreover, it delivers strong performance with minimal labeled samples for fine-tuning, making it well-suited for practical deployment.

Problem

Research questions and friction points this paper is trying to address.

Addressing labeled data scarcity for event-based 3D hand pose estimation

Overcoming limitations of pseudo-event generation for non-stationary hands

Enabling effective pre-training using RGB images for event-based estimation

Innovation

Methods, ideas, or system contributions that make the work stand out.

Uses labeled RGB images to train event-based estimators

Decomposes hand movements into step-by-step motions

Imposes motion reversal constraint for event generation

🔎 Similar Papers

No similar papers found.