A Closer Look at Cross-Domain Few-Shot Object Detection: Fine-Tuning Matters and Parallel Decoder Helps

📅 2026-03-30
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the challenges of few-shot object detection in cross-domain scenarios—namely, severe data scarcity, optimization instability, and poor generalization—by proposing a parameter-free hybrid ensemble decoder combined with a unified progressive fine-tuning framework. The approach enhances prediction diversity through parallel decoding branches and stabilizes training via a denoising query mechanism and platform-aware learning rate scheduling. Leveraging the shared hierarchical structure of pretrained models, the method achieves significant performance gains without relying on sophisticated data augmentation or extensive hyperparameter tuning. On the RF100-VL benchmark under the 10-shot setting, it attains 41.9 mAP, outperforming SAM3 (35.7 mAP), and demonstrates superior robustness to out-of-distribution shifts in mixed-domain evaluations on CD-FSOD.
📝 Abstract
Few-shot object detection (FSOD) is challenging due to unstable optimization and limited generalization arising from the scarcity of training samples. To address these issues, we propose a hybrid ensemble decoder that enhances generalization during fine-tuning. Inspired by ensemble learning, the decoder comprises a shared hierarchical layer followed by multiple parallel decoder branches, where each branch employs denoising queries either inherited from the shared layer or newly initialized to encourage prediction diversity. This design fully exploits pretrained weights without introducing additional parameters, and the resulting diverse predictions can be effectively ensembled to improve generalization. We further leverage a unified progressive fine-tuning framework with a plateau-aware learning rate schedule, which stabilizes optimization and achieves strong few-shot adaptation without complex data augmentations or extensive hyperparameter tuning. Extensive experiments on CD-FSOD, ODinW-13, and RF100-VL validate the effectiveness of our approach. Notably, on RF100-VL, which includes 100 datasets across diverse domains, our method achieves an average performance of 41.9 in the 10-shot setting, significantly outperforming the recent approach SAM3, which obtains 35.7. We further construct a mixed-domain test set from CD-FSOD to evaluate robustness to out-of-distribution (OOD) samples, showing that our proposed modules lead to clear improvement gains. These results highlight the effectiveness, generalization, and robustness of the proposed method. Code is available at: https://github.com/Intellindust-AI-Lab/FT-FSOD.
Problem

Research questions and friction points this paper is trying to address.

Few-shot object detection
Cross-domain
Generalization
Optimization stability
Out-of-distribution robustness
Innovation

Methods, ideas, or system contributions that make the work stand out.

hybrid ensemble decoder
parallel decoder branches
progressive fine-tuning
plateau-aware learning rate
cross-domain few-shot object detection
🔎 Similar Papers
No similar papers found.