🤖 AI Summary
Accurate 3D reconstruction of oral anatomical structures from a single panoramic radiograph (PX) remains challenging due to reliance on CBCT registration, image unwrapping, or prior dental-arch knowledge—leading to high radiation exposure, elevated costs, and inherent depth ambiguity. Method: We propose the first end-to-end Vision Transformer (ViT)-enhanced Neural Beer–Lambert framework. Key innovations include non-overlapping horseshoe-shaped ray sampling (reducing computation by 52%), a ViT-CNN hybrid backbone, and learnable hash-based positional encoding. Crucially, our method requires neither CBCT, nor image unwrapping, nor prior dental-arch assumptions—only a single PX view. Results: Our approach achieves state-of-the-art performance in PSNR and SSIM, with superior visual fidelity in reconstructed 3D anatomy. It establishes a new paradigm for low-radiation, cost-effective, and clinically deployable dental 3D diagnosis.
📝 Abstract
Dental diagnosis relies on two primary imaging modalities: panoramic radiographs (PX) providing 2D oral cavity representations, and Cone-Beam Computed Tomography (CBCT) offering detailed 3D anatomical information. While PX images are cost-effective and accessible, their lack of depth information limits diagnostic accuracy. CBCT addresses this but presents drawbacks including higher costs, increased radiation exposure, and limited accessibility. Existing reconstruction models further complicate the process by requiring CBCT flattening or prior dental arch information, often unavailable clinically. We introduce ViT-NeBLa, a vision transformer-based Neural Beer-Lambert model enabling accurate 3D reconstruction directly from single PX. Our key innovations include: (1) enhancing the NeBLa framework with Vision Transformers for improved reconstruction capabilities without requiring CBCT flattening or prior dental arch information, (2) implementing a novel horseshoe-shaped point sampling strategy with non-intersecting rays that eliminates intermediate density aggregation required by existing models due to intersecting rays, reducing sampling point computations by $52 %$, (3) replacing CNN-based U-Net with a hybrid ViT-CNN architecture for superior global and local feature extraction, and (4) implementing learnable hash positional encoding for better higher-dimensional representation of 3D sample points compared to existing Fourier-based dense positional encoding. Experiments demonstrate that ViT-NeBLa significantly outperforms prior state-of-the-art methods both quantitatively and qualitatively, offering a cost-effective, radiation-efficient alternative for enhanced dental diagnostics.