Language-Driven Dual Style Mixing for Single-Domain Generalized Object Detection

📅 2025-05-12

📈 Citations: 0

✨ Influential: 0

career value

164K/year

🤖 AI Summary

This paper addresses the challenge of generalizing single-source-domain-trained object detectors to multiple unknown target domains. We propose Language-Driven Dual-level Style Mixing (LDSM), a model-agnostic approach that enhances cross-domain robustness without modifying detector architectures. Leveraging vision-language models (VLMs), LDSM extracts semantic prompts to decouple semantic adaptation from detection architecture. Specifically, it comprises: (1) image-level style mixing, guided by VLM-based image translation; and (2) feature-level dual-pipeline style mixing, compatible with one-stage, two-stage, and Transformer-based detectors. Extensive experiments on diverse real-world domain shifts—including real-to-cartoon and normal-to-adverse-weather scenarios—demonstrate significant improvements over state-of-the-art methods. The code and pretrained models are publicly released.

Technology Category

Application Category

📝 Abstract

Generalizing an object detector trained on a single domain to multiple unseen domains is a challenging task. Existing methods typically introduce image or feature augmentation to diversify the source domain to raise the robustness of the detector. Vision-Language Model (VLM)-based augmentation techniques have been proven to be effective, but they require that the detector's backbone has the same structure as the image encoder of VLM, limiting the detector framework selection. To address this problem, we propose Language-Driven Dual Style Mixing (LDDS) for single-domain generalization, which diversifies the source domain by fully utilizing the semantic information of the VLM. Specifically, we first construct prompts to transfer style semantics embedded in the VLM to an image translation network. This facilitates the generation of style diversified images with explicit semantic information. Then, we propose image-level style mixing between the diversified images and source domain images. This effectively mines the semantic information for image augmentation without relying on specific augmentation selections. Finally, we propose feature-level style mixing in a double-pipeline manner, allowing feature augmentation to be model-agnostic and can work seamlessly with the mainstream detector frameworks, including the one-stage, two-stage, and transformer-based detectors. Extensive experiments demonstrate the effectiveness of our approach across various benchmark datasets, including real to cartoon and normal to adverse weather tasks. The source code and pre-trained models will be publicly available at https://github.com/qinhongda8/LDDS.

Problem

Research questions and friction points this paper is trying to address.

Generalizing object detection from single to multiple unseen domains

Overcoming VLM-based augmentation's detector framework limitations

Enhancing robustness via language-driven image and feature style mixing

Innovation

Methods, ideas, or system contributions that make the work stand out.

Utilizes VLM semantic info for style transfer

Mixes image-level styles for diverse augmentation

Applies feature-level style mixing model-agnostically

🔎 Similar Papers

GOOD: Towards Domain Generalized Orientated Object Detection