Pointing-Based Object Recognition

📅 2026-03-16
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This study addresses the challenge of accurately identifying the target object indicated by a human pointing gesture using only a single RGB image. To this end, the authors propose a modular pipeline that integrates object detection, human pose estimation, monocular depth estimation, and a vision-language model. By reconstructing 3D spatial relationships and leveraging image captioning to correct classification errors, the approach effectively resolves ambiguities in pointing direction. This work presents the first systematic evaluation of the synergistic role between 3D spatial information reconstructed from a single image and vision-language models for pointing target recognition. Experiments on a newly curated dataset demonstrate that incorporating depth cues significantly improves accuracy in complex occlusion scenarios, without requiring specialized depth sensors, thereby offering strong deployment flexibility.

Technology Category

Application Category

📝 Abstract
This paper presents a comprehensive pipeline for recognizing objects targeted by human pointing gestures using RGB images. As human-robot interaction moves toward more intuitive interfaces, the ability to identify targets of non-verbal communication becomes crucial. Our proposed system integrates several existing state-of-the-art methods, including object detection, body pose estimation, monocular depth estimation, and vision-language models. We evaluate the impact of 3D spatial information reconstructed from a single image and the utility of image captioning models in correcting classification errors. Experimental results on a custom dataset show that incorporating depth information significantly improves target identification, especially in complex scenes with overlapping objects. The modularity of the approach allows for deployment in environments where specialized depth sensors are unavailable.
Problem

Research questions and friction points this paper is trying to address.

pointing gesture
object recognition
RGB image
target identification
human-robot interaction
Innovation

Methods, ideas, or system contributions that make the work stand out.

pointing-based recognition
monocular depth estimation
vision-language models
3D spatial reasoning
modular pipeline
🔎 Similar Papers
2024-08-202024 2nd International Conference on Computer, Vision and Intelligent Technology (ICCVIT)Citations: 2
L
Lukáš Hajdúch
Comenius University, Bratislava, Slovakia
Viktor Kocur
Viktor Kocur
Assistant Professor, Comenius University
computer vision3D visiondeep learning