Generative Editing in the Joint Vision-Language Space for Zero-Shot Composed Image Retrieval

📅 2025-12-01
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address the large vision-language modality gap and reliance on triplet annotations in zero-shot compositional image retrieval (ZS-CIR), this paper proposes a generative cross-modal alignment framework based on diffusion models. Our method jointly embeds visual and linguistic modalities into a unified latent space and introduces a multimodal fusion feature editing strategy to enable fine-grained, cooperative editing of reference images and textual modifications. Furthermore, we design a lightweight Control-Adapter module, requiring only 200K synthetic samples for fine-tuning—dramatically improving data efficiency while achieving state-of-the-art performance. Extensive experiments demonstrate consistent superiority over existing zero-shot methods on three benchmarks: CIRR, FashionIQ, and CIRCO. Ablation studies and interpretability visualizations further validate the rationality and efficacy of our editing mechanism.

Technology Category

Application Category

📝 Abstract
Composed Image Retrieval (CIR) enables fine-grained visual search by combining a reference image with a textual modification. While supervised CIR methods achieve high accuracy, their reliance on costly triplet annotations motivates zero-shot solutions. The core challenge in zero-shot CIR (ZS-CIR) stems from a fundamental dilemma: existing text-centric or diffusion-based approaches struggle to effectively bridge the vision-language modality gap. To address this, we propose Fusion-Diff, a novel generative editing framework with high effectiveness and data efficiency designed for multimodal alignment. First, it introduces a multimodal fusion feature editing strategy within a joint vision-language (VL) space, substantially narrowing the modality gap. Second, to maximize data efficiency, the framework incorporates a lightweight Control-Adapter, enabling state-of-the-art performance through fine-tuning on only a limited-scale synthetic dataset of 200K samples. Extensive experiments on standard CIR benchmarks (CIRR, FashionIQ, and CIRCO) demonstrate that Fusion-Diff significantly outperforms prior zero-shot approaches. We further enhance the interpretability of our model by visualizing the fused multimodal representations.
Problem

Research questions and friction points this paper is trying to address.

Bridges vision-language modality gap for zero-shot composed image retrieval
Enables fine-grained visual search with image and text modifications
Achieves high performance using limited synthetic data efficiently
Innovation

Methods, ideas, or system contributions that make the work stand out.

Generative editing in joint vision-language space
Multimodal fusion feature editing strategy
Lightweight Control-Adapter for data efficiency
🔎 Similar Papers
No similar papers found.
X
Xin Wang
Renmin University of China
H
Haipeng Zhang
Alibaba Group
M
Mang Li
Alibaba Group
Z
Zhaohui Xia
Alibaba Group
Y
Yueguo Chen
Renmin University of China
Y
Yu Zhang
Alibaba Group
Chunyu Wei
Chunyu Wei
Renmin University of China
Graph Machine Learning、Social Computing