Surgical Scene Segmentation using a Spike-Driven Video Transformer with Real-Time Potential

📅 2025-12-24
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address the high computational cost of deep learning models for real-time surgical video segmentation and their difficulty in deployment on resource-constrained, non-GPU platforms—as well as the limited representational capacity of Spiking Neural Networks (SNNs) due to sparse annotations and inherent event sparsity in surgical videos—this paper proposes SpikeSurgSeg, the first spike-driven video Transformer framework designed specifically for non-GPU hardware. Its core contributions are: (1) the first mask autoencoding pretraining strategy tailored for SNNs in surgical scenarios; and (2) a lightweight temporal-consistent spike-based segmentation head. Evaluated on EndoVis18 and SurgBleed, SpikeSurgSeg achieves mIoU comparable to state-of-the-art artificial neural networks (ANNs), reduces inference latency by over 8×, and accelerates inference by more than 20× relative to mainstream vision foundation models—significantly enhancing real-time intraoperative situational awareness while improving energy efficiency.

Technology Category

Application Category

📝 Abstract
Modern surgical systems increasingly rely on intelligent scene understanding to provide timely situational awareness for enhanced intra-operative safety. Within this pipeline, surgical scene segmentation plays a central role in accurately perceiving operative events. Although recent deep learning models, particularly large-scale foundation models, achieve remarkable segmentation accuracy, their substantial computational demands and power consumption hinder real-time deployment in resource-constrained surgical environments. To address this limitation, we explore the emerging SNN as a promising paradigm for highly efficient surgical intelligence. However, their performance is still constrained by the scarcity of labeled surgical data and the inherently sparse nature of surgical video representations. To this end, we propose extit{SpikeSurgSeg}, the first spike-driven video Transformer framework tailored for surgical scene segmentation with real-time potential on non-GPU platforms. To address the limited availability of surgical annotations, we introduce a surgical-scene masked autoencoding pretraining strategy for SNNs that enables robust spatiotemporal representation learning via layer-wise tube masking. Building on this pretrained backbone, we further adopt a lightweight spike-driven segmentation head that produces temporally consistent predictions while preserving the low-latency characteristics of SNNs. Extensive experiments on EndoVis18 and our in-house SurgBleed dataset demonstrate that SpikeSurgSeg achieves mIoU comparable to SOTA ANN-based models while reducing inference latency by at least $8 imes$. Notably, it delivers over $20 imes$ acceleration relative to most foundation-model baselines, underscoring its potential for time-critical surgical scene segmentation.
Problem

Research questions and friction points this paper is trying to address.

Develops a spike-driven video Transformer for real-time surgical scene segmentation
Addresses computational demands of deep learning in resource-limited surgical settings
Enhances segmentation accuracy and speed on non-GPU platforms with sparse data
Innovation

Methods, ideas, or system contributions that make the work stand out.

Spike-driven video Transformer for surgical segmentation
Surgical-scene masked autoencoding pretraining for SNNs
Lightweight spike-driven head for low-latency predictions
🔎 Similar Papers
No similar papers found.
Shihao Zou
Shihao Zou
Shenzhen Institute of Advanced Technology, Chinese Academy of Sciences
computer vision
J
Jingjing Li
University of Alberta, Edmonton, Canada
W
Wei Ji
School of Medicine, Yale University, New Haven, US
J
Jincai Huang
Southern University of Science and Technology, jointly with Shenzhen University of Advanced Technology, Shenzhen, China
K
Kai Wang
Nanfang Hospital Southern Medical University, Guangzhou, China
G
Guo Dan
School of Biomedical Engineering, Shenzhen University, Shenzhen, China
Weixin Si
Weixin Si
Shenzhen University of Advanced Technology
Mixed RealityPhysically Based ModelingMedical Data Analysis
Y
Yi Pan
Faculty of Computer Science and Control Engineering, Shenzhen University of Advanced Technology, Shenzhen, China