🤖 AI Summary
This work addresses the limited generalization of medical image analysis models across domains and diverse populations, a challenge exacerbated by existing style transfer methods that often suffer from insufficient style diversity or introduce artifacts. To overcome this, the authors propose a novel Vision Transformer (ViT) encoder that, for the first time in ViT architectures, integrates weight-shared self-attention and cross-attention mechanisms. The self-attention module preserves anatomical structure, while the cross-attention module enables artifact-free, instance-level style transfer with high diversity. The approach supports data augmentation during both training and inference. Evaluated on three classification tasks in histopathology and dermatology, the method achieves up to a 13% accuracy improvement over current approaches; with test-time augmentation, performance gains reach 17%. The generated images exhibit high visual fidelity and structural consistency.
📝 Abstract
Deep learning models in medical image analysis often struggle with generalizability across domains and demographic groups due to data heterogeneity and scarcity. Traditional augmentation improves robustness, but fails under substantial domain shifts. Recent advances in stylistic augmentation enhance domain generalization by varying image styles but fall short in terms of style diversity or by introducing artifacts into the generated images. To address these limitations, we propose Stylizing ViT, a novel Vision Transformer encoder that utilizes weight-shared attention blocks for both self- and cross-attention. This design allows the same attention block to maintain anatomical consistency through self-attention while performing style transfer via cross-attention. We assess the effectiveness of our method for domain generalization by employing it for data augmentation on three distinct image classification tasks in the context of histopathology and dermatology. Results demonstrate an improved robustness (up to +13% accuracy) over the state of the art while generating perceptually convincing images without artifacts. Additionally, we show that Stylizing ViT is effective beyond training, achieving a 17% performance improvement during inference when used for test-time augmentation. The source code is available at https://github.com/sdoerrich97/stylizing-vit .