Generative Editing in the Joint Vision-Language Space for Zero-Shot Composed Image Retrieval

📅 2025-12-01

📈 Citations: 0

✨ Influential: 0

career value

188K/year

🤖 AI Summary

To address the large vision-language modality gap and reliance on triplet annotations in zero-shot compositional image retrieval (ZS-CIR), this paper proposes a generative cross-modal alignment framework based on diffusion models. Our method jointly embeds visual and linguistic modalities into a unified latent space and introduces a multimodal fusion feature editing strategy to enable fine-grained, cooperative editing of reference images and textual modifications. Furthermore, we design a lightweight Control-Adapter module, requiring only 200K synthetic samples for fine-tuning—dramatically improving data efficiency while achieving state-of-the-art performance. Extensive experiments demonstrate consistent superiority over existing zero-shot methods on three benchmarks: CIRR, FashionIQ, and CIRCO. Ablation studies and interpretability visualizations further validate the rationality and efficacy of our editing mechanism.

Technology Category

Application Category

📝 Abstract

Composed Image Retrieval (CIR) enables fine-grained visual search by combining a reference image with a textual modification. While supervised CIR methods achieve high accuracy, their reliance on costly triplet annotations motivates zero-shot solutions. The core challenge in zero-shot CIR (ZS-CIR) stems from a fundamental dilemma: existing text-centric or diffusion-based approaches struggle to effectively bridge the vision-language modality gap. To address this, we propose Fusion-Diff, a novel generative editing framework with high effectiveness and data efficiency designed for multimodal alignment. First, it introduces a multimodal fusion feature editing strategy within a joint vision-language (VL) space, substantially narrowing the modality gap. Second, to maximize data efficiency, the framework incorporates a lightweight Control-Adapter, enabling state-of-the-art performance through fine-tuning on only a limited-scale synthetic dataset of 200K samples. Extensive experiments on standard CIR benchmarks (CIRR, FashionIQ, and CIRCO) demonstrate that Fusion-Diff significantly outperforms prior zero-shot approaches. We further enhance the interpretability of our model by visualizing the fused multimodal representations.

Problem

Research questions and friction points this paper is trying to address.

Bridges vision-language modality gap for zero-shot composed image retrieval

Enables fine-grained visual search with image and text modifications

Achieves high performance using limited synthetic data efficiently

Innovation

Methods, ideas, or system contributions that make the work stand out.

Generative editing in joint vision-language space

Multimodal fusion feature editing strategy

Lightweight Control-Adapter for data efficiency

🔎 Similar Papers

No similar papers found.