🤖 AI Summary
Existing pointing gesture recognition methods rely on depth cameras, are confined to indoor environments, and support only discrete target selection. To address these limitations, this paper proposes PointingNet—the first end-to-end pointing gesture understanding framework that operates solely with a single RGB camera and generalizes across both indoor and outdoor scenes. Methodologically, we introduce arm segmentation masks to guide pointing detection and design an angular regression branch for high-precision 3D pointing direction estimation (mean angular error <2°, improving upon the state-of-the-art by 26°). Integrating geometric projection with motion planning, the framework directly outputs robot-reachable target coordinates. Evaluated on two real-world robotic platforms, PointingNet robustly interprets natural pointing gestures and enables accurate navigation. Our approach significantly enhances the practicality and generalization capability of human–robot interaction in depth-sensor-free settings.
📝 Abstract
In communication between humans, gestures are often preferred or complementary to verbal expression since the former offers better spatial referral. Finger pointing gesture conveys vital information regarding some point of interest in the environment. In human-robot interaction, a user can easily direct a robot to a target location, for example, in search and rescue or factory assistance. State-of-the-art approaches for visual pointing estimation often rely on depth cameras, are limited to indoor environments and provide discrete predictions between limited targets. In this paper, we explore the learning of models for robots to understand pointing directives in various indoor and outdoor environments solely based on a single RGB camera. A novel framework is proposed which includes a designated model termed PointingNet. PointingNet recognizes the occurrence of pointing followed by approximating the position and direction of the index finger. The model relies on a novel segmentation model for masking any lifted arm. While state-of-the-art human pose estimation models provide poor pointing angle estimation accuracy of 28deg, PointingNet exhibits mean accuracy of less than 2deg. With the pointing information, the target is computed followed by planning and motion of the robot. The framework is evaluated on two robotic systems yielding accurate target reaching.