EvHand-FPV: Efficient Event-Based 3D Hand Tracking from First-Person View

📅 2025-09-17

📈 Citations: 0

✨ Influential: 0

career value

204K/year

🤖 AI Summary

First-person 3D hand pose estimation for resource-constrained XR devices remains challenging due to high latency and power consumption. To address this, we propose an event-camera-based lightweight and efficient framework that avoids explicit 3D reconstruction. Our method introduces a geometric prior—derived from wrist region geometry—to guide region-of-interest (ROI) localization, coupled with an end-to-end offset embedding and multi-task learning architecture. An auxiliary geometric feature head and a highly streamlined network are incorporated, jointly trained on synthetic and real-world event data to achieve accuracy–efficiency co-optimization. Experiments show state-of-the-art performance: 0.85 2D-AUCp (+0.08) on real test data, 89% parameter reduction, and only 0.185G FLOPs per inference; 3D-AUCp remains 0.84 on synthetic data. The core contribution is a geometrically guided, low-overhead end-to-end paradigm for hand pose estimation tailored for edge XR systems.

Technology Category

Application Category

📝 Abstract

Hand tracking holds great promise for intuitive interaction paradigms, but frame-based methods often struggle to meet the requirements of accuracy, low latency, and energy efficiency, especially in resource-constrained settings such as Extended Reality (XR) devices. Event cameras provide $μ$s-level temporal resolution at mW-level power by asynchronously sensing brightness changes. In this work, we present EvHand-FPV, a lightweight framework for egocentric First-Person-View 3D hand tracking from a single event camera. We construct an event-based FPV dataset that couples synthetic training data with 3D labels and real event data with 2D labels for evaluation to address the scarcity of egocentric benchmarks. EvHand-FPV also introduces a wrist-based region of interest (ROI) that localizes the hand region via geometric cues, combined with an end-to-end mapping strategy that embeds ROI offsets into the network to reduce computation without explicit reconstruction, and a multi-task learning strategy with an auxiliary geometric feature head that improves representations without test-time overhead. On our real FPV test set, EvHand-FPV improves 2D-AUCp from 0.77 to 0.85 while reducing parameters from 11.2M to 1.2M by 89% and FLOPs per inference from 1.648G to 0.185G by 89%. It also maintains a competitive 3D-AUCp of 0.84 on synthetic data. These results demonstrate accurate and efficient egocentric event-based hand tracking suitable for on-device XR applications. The dataset and code are available at https://github.com/zen5x5/EvHand-FPV.

Problem

Research questions and friction points this paper is trying to address.

Efficient 3D hand tracking from first-person view

Overcoming accuracy and latency limitations in XR devices

Reducing computational requirements for event-based vision

Innovation

Methods, ideas, or system contributions that make the work stand out.

Wrist-based ROI localization via geometric cues

End-to-end mapping with embedded ROI offsets

Multi-task learning with auxiliary geometric feature head

🔎 Similar Papers

WiLoR: End-to-end 3D Hand Localization and Reconstruction in-the-wild

2024-09-18arXiv.orgCitations: 8

ByteDance

San Jose

Research Scientist Intern, Machine Perception for Input and Interaction (PhD)