🤖 AI Summary
This study investigates the trade-off between frame rate and segmentation performance in real-time zero-shot surgical video segmentation. We identify a counterintuitive phenomenon wherein SAM2 achieves higher offline mIoU at low frame rates (1 FPS) than at high frame rates (25 FPS), revealing a misalignment between conventional offline evaluation and clinical deployment requirements. To address this, we propose a redefined real-time evaluation paradigm jointly grounded in *streaming inference latency*, *temporal mask consistency*, and *surgeon clinical preferences*. Methodologically, we develop a zero-shot transfer framework built upon SAM2, incorporating a multi-granularity frame sampling strategy and introducing novel quantitative metrics for streaming latency and mask jitter. Evaluated on a cholecystectomy dataset: (1) offline mIoU improves by 3.2% at 1 FPS; (2) at 25 FPS, temporal jitter reduces by 67% under streaming conditions, and 92% of expert surgeons prefer its overlay visualization. This work is the first to expose and characterize the frame-rate–performance evaluation mismatch, advancing the design of clinically viable real-time segmentation systems toward dual-driven optimization—real-time streaming fidelity and human-centered usability.
📝 Abstract
Real-time video segmentation is a promising feature for AI-assisted surgery, providing intraoperative guidance by identifying surgical tools and anatomical structures. However, deploying state-of-the-art segmentation models, such as SAM2, in real-time settings is computationally demanding, which makes it essential to balance frame rate and segmentation performance. In this study, we investigate the impact of frame rate on zero-shot surgical video segmentation, evaluating SAM2's effectiveness across multiple frame sampling rates for cholecystectomy procedures. Surprisingly, our findings indicate that in conventional evaluation settings, frame rates as low as a single frame per second can outperform 25 FPS, as fewer frames smooth out segmentation inconsistencies. However, when assessed in a real-time streaming scenario, higher frame rates yield superior temporal coherence and stability, particularly for dynamic objects such as surgical graspers. Finally, we investigate human perception of real-time surgical video segmentation among professionals who work closely with such data and find that respondents consistently prefer high FPS segmentation mask overlays, reinforcing the importance of real-time evaluation in AI-assisted surgery.