DINOv3-Guided Cross Fusion Framework for Semantic-aware CT generation from MRI and CBCT

📅 2025-11-15
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address the limited global semantic understanding of CNNs and the overfitting tendency of Transformers on small-scale medical imaging data in MRI/CBCT-to-CT synthesis, this work proposes a semantic-aware cross-modal image translation framework. Methodologically, it introduces— for the first time— a frozen self-supervised DINOv3 Transformer to extract robust global semantic representations, coupled with a lightweight trainable CNN encoder-decoder for local detail capture, and a learnable cross-modal cross-fusion module for hierarchical feature collaboration. Additionally, a multi-level DINOv3-perceptual loss (MLDP) is proposed to enforce semantic consistency during optimization. Evaluated on the SynthRAD2023 pelvic dataset, the method achieves state-of-the-art performance in both MRI→CT and CBCT→CT synthesis, significantly improving MS-SSIM, PSNR, and downstream segmentation metrics. Results demonstrate its efficacy and generalizability for radiotherapy dose planning and adaptive treatment.

Technology Category

Application Category

📝 Abstract
Generating synthetic CT images from CBCT or MRI has a potential for efficient radiation dose planning and adaptive radiotherapy. However, existing CNN-based models lack global semantic understanding, while Transformers often overfit small medical datasets due to high model capacity and weak inductive bias. To address these limitations, we propose a DINOv3-Guided Cross Fusion (DGCF) framework that integrates a frozen self-supervised DINOv3 Transformer with a trainable CNN encoder-decoder. It hierarchically fuses global representation of Transformer and local features of CNN via a learnable cross fusion module, achieving balanced local appearance and contextual representation. Furthermore, we introduce a Multi-Level DINOv3 Perceptual (MLDP) loss that encourages semantic similarity between synthetic CT and the ground truth in DINOv3's feature space. Experiments on the SynthRAD2023 pelvic dataset demonstrate that DGCF achieved state-of-the-art performance in terms of MS-SSIM, PSNR and segmentation-based metrics on both MRI$ ightarrow$CT and CBCT$ ightarrow$CT translation tasks. To the best of our knowledge, this is the first work to employ DINOv3 representations for medical image translation, highlighting the potential of self-supervised Transformer guidance for semantic-aware CT synthesis. The code is available at https://github.com/HiLab-git/DGCF.
Problem

Research questions and friction points this paper is trying to address.

Generating synthetic CT from MRI and CBCT for radiotherapy planning
Addressing CNN's limited semantic understanding in medical imaging
Solving Transformer overfitting on small medical datasets
Innovation

Methods, ideas, or system contributions that make the work stand out.

Integrates frozen DINOv3 Transformer with trainable CNN encoder-decoder
Hierarchically fuses global and local features via cross fusion
Uses Multi-Level DINOv3 Perceptual loss for semantic similarity
🔎 Similar Papers
No similar papers found.
Xianhao Zhou
Xianhao Zhou
University of Electronic Science and Technology of China
computer vision
Jianghao Wu
Jianghao Wu
Monash University
Medical Image AnalysisComputer VisionNatural Language Processing
K
Ku Zhao
School of Mechanical and Electrical Engineering, University of Electronic Science and Technology of China, Chengdu, China
J
Jinlong He
School of Mechanical and Electrical Engineering, University of Electronic Science and Technology of China, Chengdu, China
Huangxuan Zhao
Huangxuan Zhao
Institute of Artificial Intelligence, School of Computer Science, Wuhan University
generative AIdeep learningmedical imaging
L
Lei Chen
Department of Radiology, Union Hospital, Tongji Medical College, Huazhong University of Science and Technology, Wuhan, China
Shaoting Zhang
Shaoting Zhang
Shanghai AI Lab; SenseTime Research
Medical Image AnalysisComputer VisionFoundation Models
Guotai Wang
Guotai Wang
Professor, University of Electronic Science and Technology of China
medical image analysiscomputer visiondeep learning