ORQA: A Benchmark and Foundation Model for Holistic Operating Room Modeling

📅 2025-05-19
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Current operating room (OR) computing systems are predominantly single-task oriented, exhibiting limited generalizability and cross-task adaptability. To address this, we propose ORQA—the first multimodal question-answering benchmark and foundational model for holistic OR scene understanding—unifying four major public OR datasets and enabling joint modeling across diverse tasks including surgical phase recognition and scene graph generation. Methodologically, we adopt a multimodal large language model architecture that fuses visual, auditory, and structured clinical data; introduce unified instruction tuning and progressive knowledge distillation to construct a scalable model family ranging from lightweight to full-parameter variants; and achieve zero-shot cross-task transfer. Evaluated on the ORQA benchmark, our approach significantly outperforms task-specific baselines and demonstrates strong generalization, establishing a novel paradigm for extensible, unified intelligent modeling of operating rooms.

Technology Category

Application Category

📝 Abstract
The real-world complexity of surgeries necessitates surgeons to have deep and holistic comprehension to ensure precision, safety, and effective interventions. Computational systems are required to have a similar level of comprehension within the operating room. Prior works, limited to single-task efforts like phase recognition or scene graph generation, lack scope and generalizability. In this work, we introduce ORQA, a novel OR question answering benchmark and foundational multimodal model to advance OR intelligence. By unifying all four public OR datasets into a comprehensive benchmark, we enable our approach to concurrently address a diverse range of OR challenges. The proposed multimodal large language model fuses diverse OR signals such as visual, auditory, and structured data, for a holistic modeling of the OR. Finally, we propose a novel, progressive knowledge distillation paradigm, to generate a family of models optimized for different speed and memory requirements. We show the strong performance of ORQA on our proposed benchmark, and its zero-shot generalization, paving the way for scalable, unified OR modeling and significantly advancing multimodal surgical intelligence. We will release our code and data upon acceptance.
Problem

Research questions and friction points this paper is trying to address.

Develop a holistic OR benchmark for diverse surgical challenges
Create multimodal model fusing visual, auditory, and structured data
Enable scalable OR modeling via progressive knowledge distillation
Innovation

Methods, ideas, or system contributions that make the work stand out.

Unifies four public OR datasets into benchmark
Multimodal model fuses visual, auditory, structured data
Progressive distillation optimizes speed and memory models
🔎 Similar Papers
No similar papers found.
E
Ege Ozsoy
Computer Aided Medical Procedures, Technische Universität München, Germany; MCML, Germany
Chantal Pellegrini
Chantal Pellegrini
Technical University of Munich
Deep LearningComputer VisionMedical ImagingNatural Language Processing
David Bani-Harouni
David Bani-Harouni
Technical University of Munich
K
Kun Yuan
Computer Aided Medical Procedures, Technische Universität München, Germany; MCML, Germany
Matthias Keicher
Matthias Keicher
Technische Universität München
Nassir Navab
Nassir Navab
Professor of Computer Science, Technische Universität München