🤖 AI Summary
This work addresses the limited 3D spatial awareness of surgical robots, a challenge exacerbated by existing approaches that either suffer from error accumulation due to multi-stage pipelines or rely on auxiliary sensors impractical in clinical settings. To overcome this, the authors propose the Spatial Surgical Transformer (SST), which, for the first time, enables end-to-end learning of 3D spatial representations aligned with the robot’s action space directly from standard stereo endoscopic images—without requiring additional sensors. The method integrates a geometric Transformer, multi-level spatial feature connectors, and visual-motor control formulated in the endoscope coordinate frame. A large-scale, photorealistic surgical 3D dataset, Surgical3D, is also introduced. Evaluated on real robotic platforms performing complex tasks such as knot-tying and ex vivo organ dissection, SST achieves state-of-the-art performance, demonstrating strong 3D generalization and significant clinical potential.
📝 Abstract
Achieving 3D spatial awareness is crucial for surgical robotic manipulation, where precise and delicate operations are required. Existing methods either explicitly reconstruct the surgical scene prior to manipulation, or enhance multi-view features by adding wrist-mounted cameras to supplement the default stereo endoscopes. However, both paradigms suffer from notable limitations: the former easily leads to error accumulation and prevents end-to-end optimization due to its multi-stage nature, while the latter is rarely adopted in clinical practice since wrist-mounted cameras can interfere with the motion of surgical robot arms. In this work, we introduce the Spatial Surgical Transformer (SST), an end-to-end visuomotor policy that empowers surgical robots with 3D spatial awareness by directly exploring 3D spatial cues embedded in endoscopic images. First, we build Surgical3D, a large-scale photorealistic dataset containing 30K stereo endoscopic image pairs with accurate 3D geometry, addressing the scarcity of 3D data in surgical scenes. Based on Surgical3D, we finetune a powerful geometric transformer to extract robust 3D latent representations from stereo endoscopes images. These representations are then seamlessly aligned with the robot's action space via a lightweight multi-level spatial feature connector (MSFC), all within an endoscope-centric coordinate frame. Extensive real-robot experiments demonstrate that SST achieves state-of-the-art performance and strong spatial generalization on complex surgical tasks such as knot tying and ex-vivo organ dissection, representing a significant step toward practical clinical deployment. The dataset and code will be released.