🤖 AI Summary
Traditional screen-based interfaces lack effective depth cues, limiting the intuitiveness and efficiency of robotic teleoperation. This work proposes a multi-view telepresence system tailored for standalone VR headsets (Meta Quest 3), which, for the first time, fuses geometric data from three synchronized cameras to generate GPU-accelerated point clouds in real time, while integrating wrist-mounted RGB video streams to augment local high-resolution texture details. The system renders immersive 3D scenes comprising approximately 75,000 points by synergistically combining global 3D structure with fine-grained visual cues. In a controlled user study with 31 participants, the proposed approach significantly outperformed baseline methods—including RGB-only, point-cloud-only, and OpenTeleVision—achieving superior performance across task success rate, completion time, subjective workload, and system usability.
📝 Abstract
Robot teleoperation is critical for applications such as remote maintenance, fleet robotics, search and rescue, and data collection for robot learning. Effective teleoperation requires intuitive 3D visualization with reliable depth cues, which conventional screen-based interfaces often fail to provide. We introduce a multi-view VR telepresence system that (1) fuses geometry from three cameras to produce GPU-accelerated point-cloud rendering on standalone VR hardware, and (2) integrates a wrist-mounted RGB stream to provide high-resolution local detail where point-cloud accuracy is limited. Our pipeline supports real-time rendering of approximately 75k points on the Meta Quest 3. A within-subject study was conducted with 31 participants to compare our system to other visualisation modalities, such as RGB streams, a projection of stereo-vision directly in the VR device and point clouds without providing additional RGB information. Across three different teleoperated manipulation tasks, we measured task success, completion time, perceived workload, and usability. Our system achieved the best overall performance, while the Point Cloud modality without RGB also outperforming the RGB streams and OpenTeleVision. These results show that combining global 3D structure with localized high-resolution detail substantially improves telepresence for manipulation and provides a strong foundation for next-generation robot teleoperation systems.