Pointing-Guided Target Estimation via Transformer-Based Attention

📅 2025-09-05
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the critical human-robot interaction (HRI) problem of interpreting human pointing gestures toward objects on a desktop surface. We propose a multimodal interaction Transformer model that operates solely on monocular RGB images. Leveraging cross-modal attention, the model maps 2D pointing trajectories onto a 3D desktop object space and jointly fuses visual and geometric features to probabilistically localize target objects. A novel contribution is the introduction of a local-region confusion matrix, which quantifies spatial consistency of predictions within neighborhood regions, thereby enhancing robustness against trajectory noise and occlusion. Experiments on the NICOL robotic platform demonstrate that our method significantly outperforms existing baselines, achieving high-accuracy (>92% Top-1 accuracy) and low-latency target prediction in real-world interactive settings. The source code is publicly available.

Technology Category

Application Category

📝 Abstract
Deictic gestures, like pointing, are a fundamental form of non-verbal communication, enabling humans to direct attention to specific objects or locations. This capability is essential in Human-Robot Interaction (HRI), where robots should be able to predict human intent and anticipate appropriate responses. In this work, we propose the Multi-Modality Inter-TransFormer (MM-ITF), a modular architecture to predict objects in a controlled tabletop scenario with the NICOL robot, where humans indicate targets through natural pointing gestures. Leveraging inter-modality attention, MM-ITF maps 2D pointing gestures to object locations, assigns a likelihood score to each, and identifies the most likely target. Our results demonstrate that the method can accurately predict the intended object using monocular RGB data, thus enabling intuitive and accessible human-robot collaboration. To evaluate the performance, we introduce a patch confusion matrix, providing insights into the model's predictions across candidate object locations. Code available at: https://github.com/lucamuellercode/MMITF.
Problem

Research questions and friction points this paper is trying to address.

Predicting human intent from pointing gestures in HRI
Mapping 2D pointing gestures to object locations
Enabling intuitive human-robot collaboration via monocular RGB
Innovation

Methods, ideas, or system contributions that make the work stand out.

Transformer-based attention for pointing gesture interpretation
Multi-modality inter-transformer architecture for target prediction
Monocular RGB data processing for object localization
🔎 Similar Papers
No similar papers found.
L
Luca Müller
University of Hamburg, Department of Informatics, Knowledge Technology Research Group, Hamburg, Germany
H
Hassan Ali
University of Hamburg, Department of Informatics, Knowledge Technology Research Group, Hamburg, Germany
Philipp Allgeuer
Philipp Allgeuer
University of Hamburg
Humanoid RoboticsDeep Learning
Lukáš Gajdošech
Lukáš Gajdošech
PhD student at Comenius University, Bratislava
Computer VisionPoint Cloud ProcessingMachine Learning
S
Stefan Wermter
University of Hamburg, Department of Informatics, Knowledge Technology Research Group, Hamburg, Germany