HiSync: Spatio-Temporally Aligning Hand Motion from Wearable IMU and On-Robot Camera for Command Source Identification in Long-Range HRI

πŸ“… 2026-03-12
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
This work addresses the challenge of identifying command sources in long-range multi-user human-robot interaction, where sensor ambiguity often degrades performance. To resolve this, the authors propose a novel approach that fuses optical flow from the robot’s camera with hand motion signals captured by wearable inertial measurement units (IMUs). By leveraging frequency-domain feature extraction, spatiotemporal alignment, and a distance-aware multi-window cross-modal fusion mechanism, the method uniquely exploits hand motion as a user-binding cue. A CSINet-based denoising network and cross-modal similarity computation further enhance robustness under long-range conditions. Evaluated in a real-world setting with three users within 34 meters, the system achieves a command-source identification accuracy of 92.32%, outperforming the state-of-the-art by 48.44%, and demonstrates practical efficacy on a physical robotic platform.

Technology Category

Application Category

πŸ“ Abstract
Long-range Human-Robot Interaction (HRI) remains underexplored. Within it, Command Source Identification (CSI) - determining who issued a command - is especially challenging due to multi-user and distance-induced sensor ambiguity. We introduce HiSync, an optical-inertial fusion framework that treats hand motion as binding cues by aligning robot-mounted camera optical flow with hand-worn IMU signals. We first elicit a user-defined (N=12) gesture set and collect a multimodal command gesture dataset (N=38) in long-range multi-user HRI scenarios. Next, HiSync extracts frequency-domain hand motion features from both camera and IMU data, and a learned CSINet denoises IMU readings, temporally aligns modalities, and performs distance-aware multi-window fusion to compute cross-modal similarity of subtle, natural gestures, enabling robust CSI. In three-person scenes up to 34m, HiSync achieves 92.32% CSI accuracy, outperforming the prior SOTA by 48.44%. HiSync is also validated on real-robot deployment. By making CSI reliable and natural, HiSync provides a practical primitive and design guidance for public-space HRI.
Problem

Research questions and friction points this paper is trying to address.

Command Source Identification
Long-range Human-Robot Interaction
Multi-user Ambiguity
Sensor Fusion
Gesture Recognition
Innovation

Methods, ideas, or system contributions that make the work stand out.

optical-inertial fusion
command source identification
hand motion alignment
long-range HRI
multimodal gesture recognition
πŸ”Ž Similar Papers
No similar papers found.
C
Chengwen Zhang
Department of Computer Science and Technology, BNRist, Tsinghua University
C
Chun Yu
Department of Computer Science and Technology, BNRist, College of AI, Tsinghua University
B
Borong Zhuang
Department of Computer Science and Technology, Tsinghua University
H
Haopeng Jin
Beijing University of Posts and Telecommunications
Q
Qingyang Wan
Academy of Arts & Design, Tsinghua University
Zhuojun Li
Zhuojun Li
Tsinghua University
Human Computer Interaction
Zhe He
Zhe He
University of Macau
deep learningreinforcement learningPOMDPs
Z
Zhoutong Ye
Department of Computer Science and Technology, Tsinghua University
Yu Mei
Yu Mei
Michigan State University
Soft RoboticsControl
Chang Liu
Chang Liu
Tsinghua University
HCI
Weinan Shi
Weinan Shi
Tsinghua University
HCI
Yuanchun Shi
Yuanchun Shi
Professor
human computer interaction