Understanding differences in applying DETR to natural and medical images

📅 2024-05-27
🏛️ Machine Learning for Biomedical Imaging
📈 Citations: 2
Influential: 0
📄 PDF
🤖 AI Summary
Transformer-based detectors (e.g., Deformable DETR), designed for natural images, underperform on medical imaging modalities—such as mammography and chest CT—due to their high resolution, small object size, low signal-to-noise ratio, and fine-grained class distinctions. Counterintuitively, standard design complexities (e.g., multi-scale feature fusion and iterative refinement) degrade performance in this domain. Method: Through systematic ablation studies, we identify a “simpler-is-better” principle for medical object detection and propose a lightweight adaptation strategy: single-scale encoder, simplified feature aggregation, direct query initialization, and one-stage bounding box refinement. Contribution/Results: Under standardized evaluation protocols, our lightweight DETR achieves state-of-the-art or comparable accuracy to complex variants on two clinical benchmarks, while improving inference speed by 2.3× and reducing parameters by 68%, challenging the prevailing assumption that Transformer detectors must be architecturally complex for high performance.

Technology Category

Application Category

📝 Abstract
Natural images depict real-world scenes such as landscapes, animals, and everyday items. Transformer-based detectors, such as the Detection Transformer, have demonstrated strong object detection performance on natural image datasets. These models are typically optimized through complex engineering strategies tailored to the characteristics of natural scenes. However, medical imaging presents unique challenges, such as high resolutions, smaller and fewer regions of interest, and subtle inter-class differences, which differ significantly from natural images. In this study, we evaluated the effectiveness of common design choices in transformer-based detectors when applied to medical imaging. Using two representative datasets, a mammography dataset and a chest CT dataset, we showed that common design choices proposed for natural images, including complex encoder architectures, multi-scale feature fusion, query initialization, and iterative bounding box refinement, fail to improve and can even be detrimental to the object detection performance. In contrast, simpler and shallower architectures often achieve equal or superior results with less computational cost. These findings highlight that standard design practices need to be reconsidered when adapting transformer models to medical imaging, and suggest that simplicity may be more effective than added complexity in this domain. Our model code and weights are publicly available at https://github.com/nyukat/Mammo-DETR
Problem

Research questions and friction points this paper is trying to address.

Evaluating DETR's design choices for medical vs natural images
Assessing transformer adaptations for large medical image sizes
Identifying simpler architectures for better medical object detection
Innovation

Methods, ideas, or system contributions that make the work stand out.

Simpler architectures outperform complex designs
Reevaluates transformer models for medical imaging
Focuses on large image sizes, subtle differences
🔎 Similar Papers
No similar papers found.
Y
Yanqi Xu
Center for Data Science, New York Univeristy, New York, NY, USA
Yiqiu Shen
Yiqiu Shen
New York University
Deep LearningMedical Image ProcessingWeakly Supervised LearningInterpretability
C
C. Fernandez‐Granda
Center for Data Science, New York Univeristy, New York, NY, USA
L
L. Heacock
NYU Langone Health, New York, NY, USA
Krzysztof J. Geras
Krzysztof J. Geras
New York University
Machine LearningDeep LearningArtificial IntelligenceMedical Imaging