DINOv3-Guided Cross Fusion Framework for Semantic-aware CT generation from MRI and CBCT

📅 2025-11-15

📈 Citations: 0

✨ Influential: 0

career value

165K/year

🤖 AI Summary

To address the limited global semantic understanding of CNNs and the overfitting tendency of Transformers on small-scale medical imaging data in MRI/CBCT-to-CT synthesis, this work proposes a semantic-aware cross-modal image translation framework. Methodologically, it introduces— for the first time— a frozen self-supervised DINOv3 Transformer to extract robust global semantic representations, coupled with a lightweight trainable CNN encoder-decoder for local detail capture, and a learnable cross-modal cross-fusion module for hierarchical feature collaboration. Additionally, a multi-level DINOv3-perceptual loss (MLDP) is proposed to enforce semantic consistency during optimization. Evaluated on the SynthRAD2023 pelvic dataset, the method achieves state-of-the-art performance in both MRI→CT and CBCT→CT synthesis, significantly improving MS-SSIM, PSNR, and downstream segmentation metrics. Results demonstrate its efficacy and generalizability for radiotherapy dose planning and adaptive treatment.

Technology Category

Application Category

📝 Abstract

Generating synthetic CT images from CBCT or MRI has a potential for efficient radiation dose planning and adaptive radiotherapy. However, existing CNN-based models lack global semantic understanding, while Transformers often overfit small medical datasets due to high model capacity and weak inductive bias. To address these limitations, we propose a DINOv3-Guided Cross Fusion (DGCF) framework that integrates a frozen self-supervised DINOv3 Transformer with a trainable CNN encoder-decoder. It hierarchically fuses global representation of Transformer and local features of CNN via a learnable cross fusion module, achieving balanced local appearance and contextual representation. Furthermore, we introduce a Multi-Level DINOv3 Perceptual (MLDP) loss that encourages semantic similarity between synthetic CT and the ground truth in DINOv3's feature space. Experiments on the SynthRAD2023 pelvic dataset demonstrate that DGCF achieved state-of-the-art performance in terms of MS-SSIM, PSNR and segmentation-based metrics on both MRI$ ightarrow$CT and CBCT$ ightarrow$CT translation tasks. To the best of our knowledge, this is the first work to employ DINOv3 representations for medical image translation, highlighting the potential of self-supervised Transformer guidance for semantic-aware CT synthesis. The code is available at https://github.com/HiLab-git/DGCF.

Problem

Research questions and friction points this paper is trying to address.

Generating synthetic CT from MRI and CBCT for radiotherapy planning

Addressing CNN's limited semantic understanding in medical imaging

Solving Transformer overfitting on small medical datasets

Innovation

Methods, ideas, or system contributions that make the work stand out.

Integrates frozen DINOv3 Transformer with trainable CNN encoder-decoder

Hierarchically fuses global and local features via cross fusion

Uses Multi-Level DINOv3 Perceptual loss for semantic similarity

🔎 Similar Papers

No similar papers found.

Bosch Group

Renningen, BW, DE

PhD – Generative Models for Closed-loop Synthesis

Bosch Group

Renningen, BW, DE

Research Scientist Intern, Multimodal Generative AI and Robotics (PhD)