🤖 AI Summary
Medical 3D images suffer from sparse anatomical landmarks and high dimensionality, making it challenging to simultaneously capture fine-grained local details and model global spatial relationships—leading to a trade-off between accuracy and efficiency. To address this, we propose a novel hybrid network architecture that, for the first time, integrates a lightweight hierarchical routing attention mechanism into a 3D CNN backbone. This enables efficient global contextual modeling and adaptive multi-scale feature fusion. The design significantly reduces computational overhead while improving robustness to missing landmarks and complex anatomical variations. Evaluated on public CT datasets, our method achieves state-of-the-art landmark detection accuracy with substantially fewer parameters and lower inference cost. It reduces mean localization error by 12.7% over existing approaches, with particularly pronounced improvements in low signal-to-noise ratio regions and areas exhibiting structural deformities.
📝 Abstract
3D landmark detection is a critical task in medical image analysis, and accurately detecting anatomical landmarks is essential for subsequent medical imaging tasks. However, mainstream deep learning methods in this field struggle to simultaneously capture fine-grained local features and model global spatial relationships, while maintaining a balance between accuracy and computational efficiency. Local feature extraction requires capturing fine-grained anatomical details, while global modeling requires understanding the spatial relationships within complex anatomical structures. The high-dimensional nature of 3D volume further exacerbates these challenges, as landmarks are sparsely distributed, leading to significant computational costs. Therefore, achieving efficient and precise 3D landmark detection remains a pressing challenge in medical image analysis. In this work, We propose a extbf{H}ybrid extbf{3}D extbf{DE}tection extbf{Net}(H3DE-Net), a novel framework that combines CNNs for local feature extraction with a lightweight attention mechanism designed to efficiently capture global dependencies in 3D volumetric data. This mechanism employs a hierarchical routing strategy to reduce computational cost while maintaining global context modeling. To our knowledge, H3DE-Net is the first 3D landmark detection model that integrates such a lightweight attention mechanism with CNNs. Additionally, integrating multi-scale feature fusion further enhances detection accuracy and robustness. Experimental results on a public CT dataset demonstrate that H3DE-Net achieves state-of-the-art(SOTA) performance, significantly improving accuracy and robustness, particularly in scenarios with missing landmarks or complex anatomical variations. We aready open-source our project, including code, data and model weights.