๐ค AI Summary
This work addresses the limited contextual awareness and autonomous decision-making capabilities of current robotic systems in total knee arthroplasty, which hinder precise bone resection. The authors propose ArthroCut, a novel framework that integrates preoperative imaging with intraoperative multimodal dataโincluding CT/MR scans, NDI tracking, RGB-D video, robotic states, and textual surgical intent. By introducing Preoperative Imaging Tokens (PIT) and Temporally Aligned Surgical Tokens (TAST), and incorporating syntactic and safety-aware decoding constraints, ArthroCut enables interpretable and highly reliable autonomous planning and execution of six standard bone cuts. Leveraging a Qwen-VL backbone for multimodal tokenization, the method achieves an average success rate of 86% across seven benchtop experiments, significantly outperforming baseline approaches and demonstrating the efficacy of the proposed multimodal alignment and constrained action generation mechanism.
๐ Abstract
Despite rapid commercialization of surgical robots, their autonomy and real-time decision-making remain limited in practice. To address this gap, we propose ArthroCut, an autonomous policy learning framework that upgrades knee arthroplasty robots from assistive execution to context-aware action generation. ArthroCut fine-tunes a Qwen--VL backbone on a self-built, time-synchronized multimodal dataset from 21 complete cases (23,205 RGB--D pairs), integrating preoperative CT/MR, intraoperative NDI tracking of bones and end effector, RGB--D surgical video, robot state, and textual intent. The method operates on two complementary token families -- Preoperative Imaging Tokens (PIT) to encode patient-specific anatomy and planned resection planes, and Time-Aligned Surgical Tokens (TAST) to fuse real-time visual, geometric, and kinematic evidence -- and emits an interpretable action grammar under grammar/safety-constrained decoding. In bench-top experiments on a knee prosthesis across seven trials, ArthroCut achieves an average success rate of 86% over the six standard resections, significantly outperforming strong baselines trained under the same protocol. Ablations show that TAST is the principal driver of reliability while PIT provides essential anatomical grounding, and their combination yields the most stable multi-plane execution. These results indicate that aligning preoperative geometry with time-aligned intraoperative perception and translating that alignment into tokenized, constrained actions is an effective path toward robust, interpretable autonomy in orthopedic robotic surgery.