🤖 AI Summary
In unsupervised cross-domain image retrieval (UCIR), performance is hindered by entanglement between semantic object features and domain-specific style features, impeding effective cross-domain matching. To address this, we propose the first diffusion model-based feature disentanglement framework for UCIR. Our method explicitly decouples semantic object representations from domain style using text-to-image generation priors, incorporates a cross-domain mutual nearest neighbor mechanism for progressive feature alignment, and augments discriminability via unsupervised contrastive learning. The core contribution lies in pioneering the integration of generative modeling—specifically diffusion-based priors—into UCIR to enable disentanglement-driven cross-domain semantic alignment. Extensive experiments across three standard benchmarks comprising 13 diverse domains demonstrate substantial improvements over state-of-the-art methods, validating both the efficacy and generalizability of generative priors in unsupervised cross-domain retrieval.
📝 Abstract
Unsupervised cross-domain image retrieval (UCIR) aims to retrieve images of the same category across diverse domains without relying on annotations. Existing UCIR methods, which align cross-domain features for the entire image, often struggle with the domain gap, as the object features critical for retrieval are frequently entangled with domain-specific styles. To address this challenge, we propose DUDE, a novel UCIR method building upon feature disentanglement. In brief, DUDE leverages a text-to-image generative model to disentangle object features from domain-specific styles, thus facilitating semantical image retrieval. To further achieve reliable alignment of the disentangled object features, DUDE aligns mutual neighbors from within domains to across domains in a progressive manner. Extensive experiments demonstrate that DUDE achieves state-of-the-art performance across three benchmark datasets over 13 domains. The code will be released.