EgoExOR: An Ego-Exo-Centric Operating Room Dataset for Surgical Activity Understanding

📅 2025-05-30

📈 Citations: 0

✨ Influential: 0

career value

189K/year

🤖 AI Summary

Existing surgical datasets typically treat egocentric and exocentric viewpoints in isolation, hindering comprehensive understanding of surgical activities. To address this, we introduce the first multimodal spinal surgery dataset integrating egocentric video from wearable glasses with exocentric RGB-D and ultrasound modalities, capturing 94 minutes of real-world procedures. It features fine-grained scene graph annotations—568,235 triplets across 36 entity classes and 22 relation types—and 84,553 frames of temporally synchronized multimodal data. We propose a deep egocentric–exocentric coupling acquisition paradigm, a clinically grounded scene graph annotation standard emphasizing surgical interactions, and establish the first multimodal joint-reasoning benchmark for surgical perception. On scene graph generation, our benchmark enables significant performance gains over adapted state-of-the-art models. This work lays a foundational resource for surgical action recognition and human-robot collaborative perception.

Technology Category

Application Category

📝 Abstract

Operating rooms (ORs) demand precise coordination among surgeons, nurses, and equipment in a fast-paced, occlusion-heavy environment, necessitating advanced perception models to enhance safety and efficiency. Existing datasets either provide partial egocentric views or sparse exocentric multi-view context, but do not explore the comprehensive combination of both. We introduce EgoExOR, the first OR dataset and accompanying benchmark to fuse first-person and third-person perspectives. Spanning 94 minutes (84,553 frames at 15 FPS) of two emulated spine procedures, Ultrasound-Guided Needle Insertion and Minimally Invasive Spine Surgery, EgoExOR integrates egocentric data (RGB, gaze, hand tracking, audio) from wearable glasses, exocentric RGB and depth from RGB-D cameras, and ultrasound imagery. Its detailed scene graph annotations, covering 36 entities and 22 relations (568,235 triplets), enable robust modeling of clinical interactions, supporting tasks like action recognition and human-centric perception. We evaluate the surgical scene graph generation performance of two adapted state-of-the-art models and offer a new baseline that explicitly leverages EgoExOR's multimodal and multi-perspective signals. This new dataset and benchmark set a new foundation for OR perception, offering a rich, multimodal resource for next-generation clinical perception.

Problem

Research questions and friction points this paper is trying to address.

Combining first-person and third-person views in ORs

Enhancing surgical activity understanding with multimodal data

Providing comprehensive annotations for clinical interaction modeling

Innovation

Methods, ideas, or system contributions that make the work stand out.

Fuses first-person and third-person perspectives

Integrates multimodal data from wearable glasses

Provides detailed scene graph annotations

🔎 Similar Papers

Leveraging Surgical Activity Grammar for Primary Intention Prediction in Laparoscopy Procedures

2024-09-29arXiv.orgCitations: 0

OpenAI

$380K – $445K • Offers Equity

San Francisco, CA, USA

Research Scientist Intern, Multimodal Generative AI and Robotics (PhD)