🤖 AI Summary
This work addresses the limitations of text-prompt-only approaches in cross-domain few-shot object detection, which often suffer from insufficient target-domain visual detail and suboptimal localization accuracy. To overcome this, the authors propose LMP, a dual-branch detector that uniquely integrates dynamically generated visual prototypes with textual prompts in parallel. Specifically, LMP constructs domain-specific visual prototypes by aggregating support region-of-interest (RoI) features and enhances discriminability through hard negative samples generated via jittered bounding boxes. The model jointly trains text-guided and vision-guided branches and fuses their predictions during inference. Evaluated across six cross-domain benchmark datasets under 1/5/10-shot settings, LMP achieves state-of-the-art or highly competitive mAP performance, effectively balancing open-vocabulary semantic understanding with fine-grained domain-adaptive visual modeling.
📝 Abstract
Cross-Domain Few-Shot Object Detection (CD-FSOD) aims to detect novel classes in unseen target domains given only a few labeled examples. While open-vocabulary detectors built on vision-language models (VLMs) transfer well, they depend almost entirely on text prompts, which encode domain-invariant semantics but miss domain-specific visual information needed for precise localization under few-shot supervision. We propose a dual-branch detector that Learns Multi-modal Prototypes, dubbed LMP, by coupling textual guidance with visual exemplars drawn from the target domain. A Visual Prototype Construction module aggregates class-level prototypes from support RoIs and dynamically generates hard-negative prototypes in query images via jittered boxes, capturing distractors and visually similar backgrounds. In the visual-guided branch, we inject these prototypes into the detection pipeline with components mirrored from the text branch as the starting point for training, while a parallel text-guided branch preserves open-vocabulary semantics. The branches are trained jointly and ensembled at inference by combining semantic abstraction with domain-adaptive details. On six cross-domain benchmark datasets and standard 1/5/10-shot settings, our method achieves state-of-the-art or highly competitive mAP.