🤖 AI Summary
To address tracking drift in minimally invasive surgical videos—caused by instrument occlusion, deformation, and intra-class appearance variation—this paper proposes an end-to-end Transformer-based multi-instrument real-time tracking framework. The method introduces two key innovations: (1) a surgical-instrument-aware dynamic query mechanism that adaptively activates queries aligned with the current instrument state; and (2) a decoupled spatiotemporal feature alignment module that separately models appearance consistency and motion continuity, thereby enhancing long-term temporal stability and fine-grained category discrimination. The architecture integrates multi-scale visual features, motion-guided attention, and online template updating. Evaluated on the EndoVis 2017/2018 benchmarks, our approach achieves 78.6% MOTA and improves IDF1 by 12.3% over the prior state of the art, while sustaining clinical-grade real-time inference at 32 FPS.