🤖 AI Summary
To address large cumulative drift and degraded accuracy under complex motion trajectories in freehand 3D ultrasound reconstruction, this paper proposes an external-tracking-free multimodal self-supervised framework. Methodologically, we design a sensor-driven temporal multi-branch network (TMS) that fuses ultrasound images with inertial measurement unit (IMU) data; introduce online multi-level consistency constraints (MCC) to jointly model motion continuity at scan-, path-, and patch-levels; and develop a multimodal self-supervised distillation strategy (MSS) to enhance generalization. Evaluated on three large public datasets, our method achieves 3.2–5.8% improvements in PSNR/SSIM and reduces cumulative pose error by 41%, demonstrating superior robustness to variable scanning speeds and diverse acquisition protocols. To the best of our knowledge, it is the first end-to-end framework to explicitly model cross-scale motion consistency without external tracking.
📝 Abstract
Three-dimensional ultrasound (US) aims to provide sonographers with the spatial relationships of anatomical structures, playing a crucial role in clinical diagnosis. Recently, deep-learning-based freehand 3-D US has made significant advancements. It reconstructs volumes by estimating transformations between images without external tracking. However, image-only reconstruction poses difficulties in reducing cumulative drift and further improving reconstruction accuracy, particularly in scenarios involving complex motion trajectories. In this context, we propose an enhanced motion network (MoNetV2) to enhance the accuracy and generalizability of reconstruction under diverse scanning velocities and tactics. First, we propose a sensor-based temporal and multibranch structure (TMS) that fuses image and motion information from a velocity perspective to improve image-only reconstruction accuracy. Second, we devise an online multilevel consistency constraint (MCC) that exploits the inherent consistency of scans to handle various scanning velocities and tactics. This constraint exploits scan-level velocity consistency (SVC), path-level appearance consistency (PAC), and patch-level motion consistency (PMC) to supervise interframe transformation estimation. Third, we distill an online multimodal self-supervised strategy (MSS) that leverages the correlation between network estimation and motion information to further reduce cumulative errors. Extensive experiments clearly demonstrate that MoNetV2 surpasses existing methods in both reconstruction quality and generalizability performance across three large datasets.