Boosting Single-domain Generalized Object Detection via Vision-Language Knowledge Interaction

📅 2025-04-27
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Single-domain generalized object detection (S-DGOD) aims to train a detector using data from only one source domain while ensuring robust generalization to multiple unseen target domains (e.g., varying weather or illumination conditions). However, existing approaches rely on coarse-grained vision-language knowledge, limiting their ability to learn domain-invariant region-level features. To address this, we propose a fine-grained cross-modal vision-language interaction framework. Our method introduces a cross-modal region-aware feature interaction mechanism and a cross-domain proposal refinement and mixing strategy—enabling, for the first time, text-image fine-grained alignment to drive region-level generalizable representation learning. It integrates vision-language model (VLM) fine-tuning, cross-modal attention, region-level contrastive learning, and dynamic proposal alignment with mixing-based augmentation. On Cityscapes-C and DWD benchmarks, our approach achieves +8.8% and +7.9% improvements in mean per-class AP (mPC), respectively, establishing new state-of-the-art performance.

Technology Category

Application Category

📝 Abstract
Single-Domain Generalized Object Detection~(S-DGOD) aims to train an object detector on a single source domain while generalizing well to diverse unseen target domains, making it suitable for multimedia applications that involve various domain shifts, such as intelligent video surveillance and VR/AR technologies. With the success of large-scale Vision-Language Models, recent S-DGOD approaches exploit pre-trained vision-language knowledge to guide invariant feature learning across visual domains. However, the utilized knowledge remains at a coarse-grained level~(e.g., the textual description of adverse weather paired with the image) and serves as an implicit regularization for guidance, struggling to learn accurate region- and object-level features in varying domains. In this work, we propose a new cross-modal feature learning method, which can capture generalized and discriminative regional features for S-DGOD tasks. The core of our method is the mechanism of Cross-modal and Region-aware Feature Interaction, which simultaneously learns both inter-modal and intra-modal regional invariance through dynamic interactions between fine-grained textual and visual features. Moreover, we design a simple but effective strategy called Cross-domain Proposal Refining and Mixing, which aligns the position of region proposals across multiple domains and diversifies them, enhancing the localization ability of detectors in unseen scenarios. Our method achieves new state-of-the-art results on S-DGOD benchmark datasets, with improvements of +8.8%~mPC on Cityscapes-C and +7.9%~mPC on DWD over baselines, demonstrating its efficacy.
Problem

Research questions and friction points this paper is trying to address.

Enhancing single-domain object detection generalization to unseen domains
Improving fine-grained region- and object-level feature learning across domains
Aligning and diversifying cross-domain proposals for better localization
Innovation

Methods, ideas, or system contributions that make the work stand out.

Cross-modal and Region-aware Feature Interaction
Cross-domain Proposal Refining and Mixing
Dynamic fine-grained textual-visual feature learning
🔎 Similar Papers
No similar papers found.
Xiaoran Xu
Xiaoran Xu
USF
Lung soundLLMHealthcareMachine learningBiomedical
J
Jiangang Yang
Institute of Microelectronics of the Chinese Academy of Sciences, Beijing, China
W
Wenyue Chong
School of Advanced Interdisciplinary Sciences, University of Chinese Academy of Sciences, Beijing, China
W
Wenhui Shi
Institute of Microelectronics of the Chinese Academy of Sciences, Beijing, China
S
Shichu Sun
School of Advanced Interdisciplinary Sciences, University of Chinese Academy of Sciences, Beijing, China
Jing Xing
Jing Xing
Lingang Laboratory
Drug Discovery Data Mining
J
Jian Liu
Institute of Microelectronics of the Chinese Academy of Sciences, Beijing, China