Hierarchical Context Transformer for Multi-level Semantic Scene Understanding

📅 2025-02-21
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address insufficient context-awareness of computer-assisted systems in operating rooms, this paper proposes a unified modeling framework for multi-level semantic scene understanding—spanning four granularities: surgical phase, step, action, and instrument detection. Methodologically, we introduce the Hierarchical Context Transformer (HCT), a novel architecture incorporating a Hierarchical Relation Aggregation Module (HRAM) to explicitly model inter-granularity dependencies, and an Instance-level Contrastive Learning (ICL) strategy for cross-task representation alignment. Furthermore, we design a lightweight variant, HCT+, achieving joint optimization of accuracy and parameter efficiency. Evaluated on the cataract surgery dataset and PSI-AVA, our approach substantially outperforms state-of-the-art methods across all semantic levels. The source code is publicly available.

Technology Category

Application Category

📝 Abstract
A comprehensive and explicit understanding of surgical scenes plays a vital role in developing context-aware computer-assisted systems in the operating theatre. However, few works provide systematical analysis to enable hierarchical surgical scene understanding. In this work, we propose to represent the tasks set [phase recognition -->step recognition -->action and instrument detection] as multi-level semantic scene understanding (MSSU). For this target, we propose a novel hierarchical context transformer (HCT) network and thoroughly explore the relations across the different level tasks. Specifically, a hierarchical relation aggregation module (HRAM) is designed to concurrently relate entries inside multi-level interaction information and then augment task-specific features. To further boost the representation learning of the different tasks, inter-task contrastive learning (ICL) is presented to guide the model to learn task-wise features via absorbing complementary information from other tasks. Furthermore, considering the computational costs of the transformer, we propose HCT+ to integrate the spatial and temporal adapter to access competitive performance on substantially fewer tunable parameters. Extensive experiments on our cataract dataset and a publicly available endoscopic PSI-AVA dataset demonstrate the outstanding performance of our method, consistently exceeding the state-of-the-art methods by a large margin. The code is available at https://github.com/Aurora-hao/HCT.
Problem

Research questions and friction points this paper is trying to address.

Develop hierarchical surgical scene understanding.
Propose hierarchical context transformer network.
Enhance task-specific features via contrastive learning.
Innovation

Methods, ideas, or system contributions that make the work stand out.

Hierarchical context transformer network
Inter-task contrastive learning enhancement
Spatial-temporal adapter integration
🔎 Similar Papers
No similar papers found.
L
Luoying Hao
Research Institute of Trustworthy Autonomous Systems and Dept. of Computer Science and Engineering, Southern University of Science and Technology, China; School of Computer Science, University of Birmingham, UK
Y
Yan Hu
Research Institute of Trustworthy Autonomous Systems and Dept. of Computer Science and Engineering, Southern University of Science and Technology, China
Y
Yang Yue
School of Computer Science, University of Birmingham, UK
Li Wu
Li Wu
Qinghai University
spatiotemporal prediction,uncertainty analysis
Huazhu Fu
Huazhu Fu
Principal Scientist, IHPC, A*STAR
Medical Image AnalysisAI for HealthcareMedical AITrustworthy AI
J
Jinming Duan
School of Computer Science, University of Birmingham, UK; Division of Informatics, Imaging and Data Sciences, School of Health Sciences, University of Manchester, UK
J
Jiang Liu
Research Institute of Trustworthy Autonomous Systems and Dept. of Computer Science and Engineering, Southern University of Science and Technology, China