🤖 AI Summary
This work addresses the limitations of existing dental image segmentation methods, which suffer from discontinuous boundaries and weak foreground-background discrimination due to fixed-resolution feature maps, as well as the high computational cost of Transformer-based self-attention mechanisms that hinder efficient processing of high-resolution images. To overcome these challenges, the authors propose a three-stage hierarchical encoder architecture that integrates multi-scale features to preserve fine structural details. The design incorporates bidirectional sequential modeling and a lightweight global context mechanism, significantly reducing computational complexity while enhancing global spatial awareness. Evaluated on the OralVision dataset, the proposed method achieves a 1.1% improvement in mIoU over current state-of-the-art approaches, demonstrating its effectiveness in improving both segmentation continuity and accuracy.
📝 Abstract
Tooth image segmentation is a cornerstone of dental digitization. However, traditional image encoders relying on fixed-resolution feature maps often lead to discontinuous segmentation and poor discrimination between target regions and background, due to insufficient modeling of environmental and global context. Moreover, transformer-based self-attention introduces substantial computational overhead because of its quadratic complexity (O(n^2)), making it inefficient for high-resolution dental images. To address these challenges, we introduce a three-stage encoder with hierarchical feature representation to capture scale-adaptive information in dental images. By jointly leveraging low-level details and high-level semantics through cross-scale feature fusion, the model effectively preserves fine structural information while maintaining strong contextual awareness. Furthermore, a bidirectional sequence modeling strategy is incorporated to enhance global spatial context understanding without incurring high computational cost.
We validate our method on two dental datasets, with experimental results demonstrating its superiority over existing approaches. On the OralVision dataset, our model achieves a 1.1% improvement in mean intersection over union (mIoU).