Vision Transformer-Conditioned UNet for Domain-Adaptive Semantic Segmentation

📅 2026-05-12

📈 Citations: 0

✨ Influential: 0

career value

178K/year

🤖 AI Summary

This work addresses the limited performance of Vision Transformers in biomedical image semantic segmentation—particularly their difficulty in capturing sparse, fine-grained structures and low signal-to-noise ratio targets, often due to lightweight decoders lacking local inductive bias. To overcome this, we propose ViTC-UNet, a novel architecture that injects frozen, pre-trained Vision Transformer representations into a UNet decoder via learnable conditional tokens and bidirectional attention mechanisms. This design effectively integrates the global visual priors of the Vision Transformer with the UNet’s local inductive bias and high-resolution reconstruction capabilities, all without requiring fine-tuning of the pretrained ViT. Extensive experiments demonstrate that our method significantly outperforms existing baselines across multiple imaging modalities, including MRI and CT, achieving the first effective zero-shot decoder transfer of large-scale visual priors for cross-domain biomedical segmentation.

📝 Abstract

Semantic segmentation is essential for analysing anatomical features in biomedical research, yet a performance gap remains for Vision Transformers (ViTs) in the field, particularly for sparse, fine-structured, and low signal-to-noise targets. We attribute this challenge in part to the lightweight pixel decoders commonly used in promptable ViT models, who may lack the local inductive bias needed for high-precision biomedical masks. We bridge this gap by introducing ViTC-UNet, which conditions a UNet on frozen pre-trained ViT representations through learnable tokens and a two-way attention decoder. This combines ViT global visual priors with the local inductive bias and high-resolution decoding capacity of UNets, while avoiding end-to-end ViT fine-tuning even in cross-domain settings. ViTC-UNet outperforms baseline results in semantic segmentation tasks across MRI and CT modalities, demonstrating that structure-conditioned UNet decoding can efficiently adapt large-scale visual priors to high-complexity biomedical segmentation.

Problem

Research questions and friction points this paper is trying to address.

semantic segmentation

Vision Transformers

biomedical imaging

domain adaptation

fine-structured targets

Innovation

Methods, ideas, or system contributions that make the work stand out.

Vision Transformer

UNet

domain-adaptive segmentation