🤖 AI Summary
Existing surgical video tracking methods struggle with clinical dynamics (e.g., instruments exiting the field of view or body cavity) and visual disturbances (e.g., smoke, specular reflections, blood), resulting in poor clinical adaptability. To address this, we introduce the first multi-class, multi-instrument tracking benchmark specifically designed for laparoscopic surgery—comprising 20 procedures, over 35,000 frames, and more than 65,000 fine-grained annotations. We propose a surgery-aware triple-trajectory definition (intra-operative, intra-instrument, and in-field-of-view), departing from generic tracking paradigms. Annotations are meticulously curated across six dimensions: spatial location, class, unique ID, surgeon, procedural phase, and visual condition—enabling unified support for multi-object tracking (MOT), instance segmentation, and spatiotemporal relationship modeling. This benchmark represents the finest-grained laparoscopic instrument tracking dataset to date, substantially enhancing the fidelity of surgical behavior modeling. It has already enabled AI-driven clinical applications including surgical skill assessment, safety-zone prediction, and human-robot collaboration, and is actively adopted by multiple international surgical AI research groups.
📝 Abstract
Tool tracking in surgical videos is vital in computer-assisted intervention for tasks like surgeon skill assessment, safety zone estimation, and human-machine collaboration during minimally invasive procedures. The lack of large-scale datasets hampers Artificial Intelligence implementation in this domain. Current datasets exhibit overly generic tracking formalization, often lacking surgical context: a deficiency that becomes evident when tools move out of the camera's scope, resulting in rigid trajectories that hinder realistic surgical representation. This paper addresses the need for a more precise and adaptable tracking formalization tailored to the intricacies of endoscopic procedures by introducing CholecTrack20, an extensive dataset meticulously annotated for multi-class multi-tool tracking across three perspectives representing the various ways of considering the temporal duration of a tool trajectory: (1) intraoperative, (2) intracorporeal, and (3) visibility within the camera's scope. The dataset comprises 20 laparoscopic videos with over 35,000 frames and 65,000 annotated tool instances with details on spatial location, category, identity, operator, phase, and surgical visual conditions. This detailed dataset caters to the evolving assistive requirements within a procedure.