🤖 AI Summary
This study addresses the challenges in robot-assisted partial nephrectomy—specifically during renal suturing—where visual similarity among actions, variable duration, and severe class imbalance hinder fine-grained analysis. To this end, the authors introduce SIA-RAPN, the first frame-level fine-grained action segmentation benchmark for this procedure, constructed from 50 clinical da Vinci Xi surgical videos. Visual features are extracted using I3D, and four state-of-the-art temporal models—MS-TCN++, AsFormer, TUT, and DiffAct—are systematically evaluated. Experimental results demonstrate that DiffAct achieves the best overall performance in terms of F1 score, frame accuracy, edit distance, and frame mAP, while MS-TCN++ excels in balanced accuracy. These findings validate the efficacy of diverse temporal architectures for fine-grained surgical action understanding and advance surgical video analysis toward greater precision and robustness.
📝 Abstract
Fine-grained action segmentation during renorrhaphy in robot-assisted partial nephrectomy requires frame-level recognition of visually similar suturing gestures with variable duration and substantial class imbalance. The SIA-RAPN benchmark defines this problem on 50 clinical videos acquired with the da Vinci Xi system and annotated with 12 frame-level labels. The benchmark compares four temporal models built on I3D features: MS-TCN++, AsFormer, TUT, and DiffAct. Evaluation uses balanced accuracy, edit score, segmental F1 at overlap thresholds of 10, 25, and 50, frame-wise accuracy, and frame-wise mean average precision. In addition to the primary evaluation across five released split configurations on SIA-RAPN, the benchmark reports cross-domain results on a separate single-port RAPN dataset. Across the strongest reported values over those five runs on the primary dataset, DiffAct achieves the highest F1, frame-wise accuracy, edit score, and frame mAP, while MS-TCN++ attains the highest balanced accuracy.