🤖 AI Summary
Existing surgical scene graph methods are limited by binary relation modeling, struggling to capture the complex multimodal group interactions and geometric structures inherent in operating rooms. This work proposes the first unified representation framework based on higher-order topology, modeling surgical scenes as higher-order topological structures. By leveraging a hierarchical higher-order attention mechanism, the approach jointly captures both pairwise and group-wise relationships while natively integrating heterogeneous multimodal data—including 3D geometry, audio, and robotic kinematics—and preserving their underlying manifold structures. Evaluated on critical tasks such as aseptic violation detection, surgical phase prediction, and next-action anticipation, the method significantly outperforms conventional graph-based models and large language model baselines, thereby overcoming the representational limitations of existing graph representations in safety-critical medical reasoning.
📝 Abstract
Surgical Scene Graphs abstract the complexity of surgical operating rooms (OR) into a structure of entities and their relations, but existing paradigms suffer from strictly dyadic structural limitations. Frameworks that predominantly rely on pairwise message passing or tokenized sequences flatten the manifold geometry inherent to relational structures and lose structure in the process. We introduce TopoOR, a new paradigm that models multimodal operating rooms as a higher-order structure, innately preserving pairwise and group relationships. By lifting interactions between entities into higher-order topological cells, TopoOR natively models complex dynamics and multimodality present in the OR. This topological representation subsumes traditional scene graphs, thereby offering strictly greater expressivity. We also propose a higher-order attention mechanism that explicitly preserves manifold structure and modality-specific features throughout hierarchical relational attention. In this way, we circumvent combining 3D geometry, audio, and robot kinematics into a single joint latent representation, preserving the precise multimodal structure required for safety-critical reasoning, unlike existing methods. Extensive experiments demonstrate that our approach outperforms traditional graph and LLM-based baselines across sterility breach detection, robot phase prediction, and next-action anticipation