🤖 AI Summary
This study addresses the challenge of accurately identifying the target object indicated by a human pointing gesture using only a single RGB image. To this end, the authors propose a modular pipeline that integrates object detection, human pose estimation, monocular depth estimation, and a vision-language model. By reconstructing 3D spatial relationships and leveraging image captioning to correct classification errors, the approach effectively resolves ambiguities in pointing direction. This work presents the first systematic evaluation of the synergistic role between 3D spatial information reconstructed from a single image and vision-language models for pointing target recognition. Experiments on a newly curated dataset demonstrate that incorporating depth cues significantly improves accuracy in complex occlusion scenarios, without requiring specialized depth sensors, thereby offering strong deployment flexibility.
📝 Abstract
This paper presents a comprehensive pipeline for recognizing objects targeted by human pointing gestures using RGB images. As human-robot interaction moves toward more intuitive interfaces, the ability to identify targets of non-verbal communication becomes crucial. Our proposed system integrates several existing state-of-the-art methods, including object detection, body pose estimation, monocular depth estimation, and vision-language models. We evaluate the impact of 3D spatial information reconstructed from a single image and the utility of image captioning models in correcting classification errors. Experimental results on a custom dataset show that incorporating depth information significantly improves target identification, especially in complex scenes with overlapping objects. The modularity of the approach allows for deployment in environments where specialized depth sensors are unavailable.