SFHand: A Streaming Framework for Language-guided 3D Hand Forecasting and Embodied Manipulation

📅 2025-11-22
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing methods struggle to simultaneously achieve real-time performance and effective language guidance, limiting their applicability in natural human–AR and embodied robotic interaction. This paper introduces the first language-guided streaming 3D hand prediction framework, enabling autoregressive sequential prediction of hand type, 2D bounding boxes, 3D hand poses, and trajectories. Our core contributions are: (1) a streaming autoregressive architecture integrated with an ROI-enhanced memory layer, jointly modeling video streams and linguistic instructions to efficiently capture temporal context and attend to salient hand regions; and (2) EgoHaFL—the first large-scale, synchronized 3D hand pose–language instruction dataset. Experiments demonstrate that our method surpasses state-of-the-art approaches by 35.8% in 3D hand pose prediction accuracy and, when transferred to embodied manipulation tasks, achieves up to a 13.4% improvement in task success rate.

Technology Category

Application Category

📝 Abstract
Real-time 3D hand forecasting is a critical component for fluid human-computer interaction in applications like AR and assistive robotics. However, existing methods are ill-suited for these scenarios, as they typically require offline access to accumulated video sequences and cannot incorporate language guidance that conveys task intent. To overcome these limitations, we introduce SFHand, the first streaming framework for language-guided 3D hand forecasting. SFHand autoregressively predicts a comprehensive set of future 3D hand states, including hand type, 2D bounding box, 3D pose, and trajectory, from a continuous stream of video and language instructions. Our framework combines a streaming autoregressive architecture with an ROI-enhanced memory layer, capturing temporal context while focusing on salient hand-centric regions. To enable this research, we also introduce EgoHaFL, the first large-scale dataset featuring synchronized 3D hand poses and language instructions. We demonstrate that SFHand achieves new state-of-the-art results in 3D hand forecasting, outperforming prior work by a significant margin of up to 35.8%. Furthermore, we show the practical utility of our learned representations by transferring them to downstream embodied manipulation tasks, improving task success rates by up to 13.4% on multiple benchmarks. Dataset page: https://huggingface.co/datasets/ut-vision/EgoHaFL, project page: https://github.com/ut-vision/SFHand.
Problem

Research questions and friction points this paper is trying to address.

Real-time 3D hand forecasting for fluid human-computer interaction in AR and robotics
Existing methods require offline video sequences without language guidance integration
Lack of synchronized 3D hand pose and language instruction datasets for training
Innovation

Methods, ideas, or system contributions that make the work stand out.

Streaming framework for real-time 3D hand forecasting
Autoregressive prediction from video and language streams
ROI-enhanced memory layer capturing temporal hand context
🔎 Similar Papers
No similar papers found.