🤖 AI Summary
Existing methods struggle to simultaneously achieve real-time performance and effective language guidance, limiting their applicability in natural human–AR and embodied robotic interaction. This paper introduces the first language-guided streaming 3D hand prediction framework, enabling autoregressive sequential prediction of hand type, 2D bounding boxes, 3D hand poses, and trajectories. Our core contributions are: (1) a streaming autoregressive architecture integrated with an ROI-enhanced memory layer, jointly modeling video streams and linguistic instructions to efficiently capture temporal context and attend to salient hand regions; and (2) EgoHaFL—the first large-scale, synchronized 3D hand pose–language instruction dataset. Experiments demonstrate that our method surpasses state-of-the-art approaches by 35.8% in 3D hand pose prediction accuracy and, when transferred to embodied manipulation tasks, achieves up to a 13.4% improvement in task success rate.
📝 Abstract
Real-time 3D hand forecasting is a critical component for fluid human-computer interaction in applications like AR and assistive robotics. However, existing methods are ill-suited for these scenarios, as they typically require offline access to accumulated video sequences and cannot incorporate language guidance that conveys task intent. To overcome these limitations, we introduce SFHand, the first streaming framework for language-guided 3D hand forecasting. SFHand autoregressively predicts a comprehensive set of future 3D hand states, including hand type, 2D bounding box, 3D pose, and trajectory, from a continuous stream of video and language instructions. Our framework combines a streaming autoregressive architecture with an ROI-enhanced memory layer, capturing temporal context while focusing on salient hand-centric regions. To enable this research, we also introduce EgoHaFL, the first large-scale dataset featuring synchronized 3D hand poses and language instructions. We demonstrate that SFHand achieves new state-of-the-art results in 3D hand forecasting, outperforming prior work by a significant margin of up to 35.8%. Furthermore, we show the practical utility of our learned representations by transferring them to downstream embodied manipulation tasks, improving task success rates by up to 13.4% on multiple benchmarks. Dataset page: https://huggingface.co/datasets/ut-vision/EgoHaFL, project page: https://github.com/ut-vision/SFHand.