VFM$^{4}$SDG: Unveiling the Power of VFMs for Single-Domain Generalized Object Detection

📅 2026-04-23

📈 Citations: 0

✨ Influential: 0

career value

194K/year

🤖 AI Summary

This work addresses the performance degradation of object detectors trained on a single source domain when deployed in unseen environments—such as varying weather or lighting conditions—due to domain shift, which often leads to increased miss rates. To mitigate this issue, the authors propose VFM⁴SDG, a novel framework that leverages a frozen vision foundation model as a dual cross-domain stability prior. Specifically, during encoding, it enhances modeling of object-background and inter-instance relationships through relation-prior distillation; during decoding, it refines query representations using semantic-contextual priors. Built upon the DETR architecture, VFM⁴SDG achieves state-of-the-art performance on the standard SDGOD benchmark, demonstrating its effectiveness, robustness, and generalizability in improving semantic recognition and localization stability under unknown domain shifts.

Technology Category

Application Category

📝 Abstract

In real-world scenarios, continual changes in weather, illumination, and imaging conditions cause significant domain shifts, leading detectors trained on a single source domain to degrade severely in unseen environments. Existing single-domain generalized object detection (SDGOD) methods mainly rely on data augmentation or domain-invariant representation learning, but pay limited attention to detector mechanisms, leaving clear limitations under complex domain shifts. Through analytical experiments, we find that performance degradation is dominated by increasing missed detections, which fundamentally arises from reduced cross-domain stability of the detector: object-background and inter-instance relations become less stable in the encoding stage, while semantic-spatial alignment of query representations also becomes harder to maintain in the decoding stage. To this end, we propose VFM$^{4}$SDG, a dual-prior learning framework for SDGOD, which introduces a frozen vision foundation model (VFM) as a transferable cross-domain stability prior into detector representation learning and query modeling. In the encoding stage, we propose Cross-domain Stable Relational Prior Distillation to enhance the robustness of object-background and inter-instance relational modeling. In the decoding stage, we propose Semantic-Contextual Prior-based Query Enhancement, which injects category-level semantic prototypes and global visual context into queries to improve their semantic recognition and spatial localization stability in unseen domains. Extensive experiments show that the proposed method consistently outperforms existing SOTA methods on standard SDGOD benchmarks and two mainstream DETR-based detectors, demonstrating its effectiveness, robustness, and generality.

Problem

Research questions and friction points this paper is trying to address.

domain shift

single-domain generalized object detection

cross-domain stability

missed detections

object detection

Innovation

Methods, ideas, or system contributions that make the work stand out.

Vision Foundation Model

Single-Domain Generalized Object Detection

Cross-domain Stability