🤖 AI Summary
To address the large vision-language modality gap and reliance on triplet annotations in zero-shot compositional image retrieval (ZS-CIR), this paper proposes a generative cross-modal alignment framework based on diffusion models. Our method jointly embeds visual and linguistic modalities into a unified latent space and introduces a multimodal fusion feature editing strategy to enable fine-grained, cooperative editing of reference images and textual modifications. Furthermore, we design a lightweight Control-Adapter module, requiring only 200K synthetic samples for fine-tuning—dramatically improving data efficiency while achieving state-of-the-art performance. Extensive experiments demonstrate consistent superiority over existing zero-shot methods on three benchmarks: CIRR, FashionIQ, and CIRCO. Ablation studies and interpretability visualizations further validate the rationality and efficacy of our editing mechanism.
📝 Abstract
Composed Image Retrieval (CIR) enables fine-grained visual search by combining a reference image with a textual modification. While supervised CIR methods achieve high accuracy, their reliance on costly triplet annotations motivates zero-shot solutions. The core challenge in zero-shot CIR (ZS-CIR) stems from a fundamental dilemma: existing text-centric or diffusion-based approaches struggle to effectively bridge the vision-language modality gap. To address this, we propose Fusion-Diff, a novel generative editing framework with high effectiveness and data efficiency designed for multimodal alignment. First, it introduces a multimodal fusion feature editing strategy within a joint vision-language (VL) space, substantially narrowing the modality gap. Second, to maximize data efficiency, the framework incorporates a lightweight Control-Adapter, enabling state-of-the-art performance through fine-tuning on only a limited-scale synthetic dataset of 200K samples. Extensive experiments on standard CIR benchmarks (CIRR, FashionIQ, and CIRCO) demonstrate that Fusion-Diff significantly outperforms prior zero-shot approaches. We further enhance the interpretability of our model by visualizing the fused multimodal representations.