Towards Comprehensive Real-Time Scene Understanding in Ophthalmic Surgery through Multimodal Image Fusion

📅 2026-03-26

📈 Citations: 0

✨ Influential: 0

career value

211K/year

🤖 AI Summary

This work addresses the challenge of real-time, precise instrument–tissue distance perception in vitreoretinal surgery by proposing a multimodal temporal perception network that, for the first time, enables temporal feature fusion between operating microscope (OPMI) images and intraoperative optical coherence tomography (iOCT). The framework extracts OPMI features using YoloNAS, encodes iOCT data via a CNN, and integrates cross-attention mechanisms with a regional recurrent module to model multimodal temporal dynamics. The resulting system achieves a single-frame inference latency of 22.5 ms while significantly enhancing multitask performance: it attains a 95.79% mAP50 for instrument detection and keypoint localization, and reduces distance estimation error from 284 μm to 33 μm within 1 mm of the retinal surface.

Technology Category

Application Category

📝 Abstract

Purpose: The integration of multimodal imaging into operating rooms paves the way for comprehensive surgical scene understanding. In ophthalmic surgery, by now, two complementary imaging modalities are available: operating microscope (OPMI) imaging and real-time intraoperative optical coherence tomography (iOCT). This first work toward temporal OPMI and iOCT feature fusion demonstrates the potential of multimodal image processing for multi-head prediction through the example of precise instrument tracking in vitreoretinal surgery. Methods: We propose a multimodal, temporal, real-time capable network architecture to perform joint instrument detection, keypoint localization, and tool-tissue distance estimation. Our network design integrates a cross-attention fusion module to merge OPMI and iOCT image features, which are efficiently extracted via a YoloNAS and a CNN encoder, respectively. Furthermore, a region-based recurrent module leverages temporal coherence. Results: Our experiments demonstrate reliable instrument localization and keypoint detection (95.79% mAP50) and show that the incorporation of iOCT significantly improves tool-tissue distance estimation, while achieving real-time processing rates of 22.5 ms per frame. Especially for close distances to the retina (below 1 mm), the distance estimation accuracy improved from 284 $μm$ (OPMI only) to 33 $μm$ (multimodal). Conclusion: Feature fusion of multimodal imaging can enhance multi-task prediction accuracy compared to single-modality processing and real-time processing performance can be achieved through tailored network design. While our results demonstrate the potential of multi-modal processing for image-guided vitreoretinal surgery, they also underline key challenges that motivate future research toward more reliable, consistent, and comprehensive surgical scene understanding.

Problem

Research questions and friction points this paper is trying to address.

real-time scene understanding

ophthalmic surgery

multimodal image fusion

instrument tracking

tool-tissue distance estimation

Innovation

Methods, ideas, or system contributions that make the work stand out.

multimodal fusion

real-time surgical scene understanding

cross-attention mechanism