🤖 AI Summary
Existing surgical datasets typically treat egocentric and exocentric viewpoints in isolation, hindering comprehensive understanding of surgical activities. To address this, we introduce the first multimodal spinal surgery dataset integrating egocentric video from wearable glasses with exocentric RGB-D and ultrasound modalities, capturing 94 minutes of real-world procedures. It features fine-grained scene graph annotations—568,235 triplets across 36 entity classes and 22 relation types—and 84,553 frames of temporally synchronized multimodal data. We propose a deep egocentric–exocentric coupling acquisition paradigm, a clinically grounded scene graph annotation standard emphasizing surgical interactions, and establish the first multimodal joint-reasoning benchmark for surgical perception. On scene graph generation, our benchmark enables significant performance gains over adapted state-of-the-art models. This work lays a foundational resource for surgical action recognition and human-robot collaborative perception.
📝 Abstract
Operating rooms (ORs) demand precise coordination among surgeons, nurses, and equipment in a fast-paced, occlusion-heavy environment, necessitating advanced perception models to enhance safety and efficiency. Existing datasets either provide partial egocentric views or sparse exocentric multi-view context, but do not explore the comprehensive combination of both. We introduce EgoExOR, the first OR dataset and accompanying benchmark to fuse first-person and third-person perspectives. Spanning 94 minutes (84,553 frames at 15 FPS) of two emulated spine procedures, Ultrasound-Guided Needle Insertion and Minimally Invasive Spine Surgery, EgoExOR integrates egocentric data (RGB, gaze, hand tracking, audio) from wearable glasses, exocentric RGB and depth from RGB-D cameras, and ultrasound imagery. Its detailed scene graph annotations, covering 36 entities and 22 relations (568,235 triplets), enable robust modeling of clinical interactions, supporting tasks like action recognition and human-centric perception. We evaluate the surgical scene graph generation performance of two adapted state-of-the-art models and offer a new baseline that explicitly leverages EgoExOR's multimodal and multi-perspective signals. This new dataset and benchmark set a new foundation for OR perception, offering a rich, multimodal resource for next-generation clinical perception.