🤖 AI Summary
This work addresses the challenge of navigating endoluminal robots through complex, narrow, and tortuous anatomical pathways, where existing visual localization methods suffer from limited accuracy due to tissue deformation, in vivo artifacts, and a lack of distinctive visual features. To overcome these limitations, the authors propose EndoSERV, a novel approach that integrates segmented structured modeling with real-to-synthetic domain transfer learning without requiring ground-truth pose labels in real data. The method partitions long endoluminal trajectories into shorter segments for independent visual odometry estimation and leverages offline pretraining to extract texture-invariant features. During inference, it adaptively maps real-image features into a synthetic domain where pose supervision is available, enabling optimization using synthetic ground-truth poses. Experiments demonstrate that EndoSERV achieves high-precision and robust navigation on both public and clinical datasets, even in the absence of real-world pose annotations.
📝 Abstract
Robot-assisted endoluminal procedures are increasingly used for early cancer intervention. However, the intricate, narrow and tortuous pathways within the luminal anatomy pose substantial difficulties for robot navigation. Vision-based navigation offers a promising solution, but existing localization approaches are error-prone due to tissue deformation, in vivo artifacts and a lack of distinctive landmarks for consistent localization. This paper presents a novel EndoSERV localization method to address these challenges. It includes two main parts, \textit{i.e.}, \textbf{SE}gment-to-structure and \textbf{R}eal-to-\textbf{V}irtual mapping, and hence the name. For long-range and complex luminal structures, we divide them into smaller sub-segments and estimate the odometry independently. To cater for label insufficiency, an efficient transfer technique maps real image features to the virtual domain to use virtual pose ground truth. The training phases of EndoSERV include an offline pretraining to extract texture-agnostic features, and an online phase that adapts to real-world conditions. Extensive experiments based on both public and clinical datasets have been performed to demonstrate the effectiveness of the method even without any real pose labels.