đ¤ AI Summary
Surgical robots suffer from insufficient robust spatial perception, leading to collisions and workflow interruptionsâparticularly in occluded or cluttered environments and in distributed multi-arm systems. To address this, we propose a markerless embodiment-aware perception method that replaces conventional infrared marker-based tracking with a lightweight stereo RGB camera and a novel Transformer architecture, enabling high-precision pose estimation and scene understanding even under complete intraoperative occlusion. Trained on 1.4 million multicenter self-annotated surgical images, our approach achieves the first end-to-end, full-scene robotic tracking under a single sterile drapeâeliminating the need for external markers or calibration. It improves field-of-view coverage by 25% and significantly reduces deployment complexity. Experimental validation confirms clinical feasibility in dynamic scenarios, including live-tissue respiratory motion compensation. This work establishes a foundational perception capability for modular, autonomous intelligent surgery.
đ Abstract
Despite their mechanical sophistication, surgical robots remain blind to their surroundings. This lack of spatial awareness causes collisions, system recoveries, and workflow disruptions, issues that will intensify with the introduction of distributed robots with independent interacting arms. Existing tracking systems rely on bulky infrared cameras and reflective markers, providing only limited views of the surgical scene and adding hardware burden in crowded operating rooms. We present a marker-free proprioception method that enables precise localisation of surgical robots under their sterile draping despite associated obstruction of visual cues. Our method solely relies on lightweight stereo-RGB cameras and novel transformer-based deep learning models. It builds on the largest multi-centre spatial robotic surgery dataset to date (1.4M self-annotated images from human cadaveric and preclinical in vivo studies). By tracking the entire robot and surgical scene, rather than individual markers, our approach provides a holistic view robust to occlusions, supporting surgical scene understanding and context-aware control. We demonstrate an example of potential clinical benefits during in vivo breathing compensation with access to tissue dynamics, unobservable under state of the art tracking, and accurately locate in multi-robot systems for future intelligent interaction. In addition, and compared with existing systems, our method eliminates markers and improves tracking visibility by 25%. To our knowledge, this is the first demonstration of marker-free proprioception for fully draped surgical robots, reducing setup complexity, enhancing safety, and paving the way toward modular and autonomous robotic surgery.