🤖 AI Summary
Existing compositional image retrieval (CIR) methods suffer from imbalanced multimodal intent fusion—either over-relying on the reference image or overly prioritizing textual modifications—leading to inaccurate modeling of fine-grained user intents (e.g., “more formal” or “add lace trim”).
Method: We propose an Intent-Aware Cross-Modal Alignment and Adaptive Token Fusion framework. First, diffusion-based pseudo-target image generation refines CLIP’s visual representations to enhance cross-modal semantic alignment. Second, adaptive token-level weighting in contrastive learning dynamically balances textual and visual token contributions, precisely capturing the joint intent encoded by “reference image + text modification.” The method builds upon CLIP, integrating pseudo-sample generation, contrastive fine-tuning, and token-wise adaptive fusion.
Results: Our approach achieves state-of-the-art performance on Fashion-IQ and CIRR, with particularly notable gains in retrieving images matching subtle semantic intents.
📝 Abstract
Composed Image Retrieval (CIR) retrieves target images using a multi-modal query that combines a reference image with text describing desired modifications. The primary challenge is effectively fusing this visual and textual information. Current cross-modal feature fusion approaches for CIR exhibit an inherent bias in intention interpretation. These methods tend to disproportionately emphasize either the reference image features (visual-dominant fusion) or the textual modification intent (text-dominant fusion through image-to-text conversion). Such an imbalanced representation often fails to accurately capture and reflect the actual search intent of the user in the retrieval results. To address this challenge, we propose TMCIR, a novel framework that advances composed image retrieval through two key innovations: 1) Intent-Aware Cross-Modal Alignment. We first fine-tune CLIP encoders contrastively using intent-reflecting pseudo-target images, synthesized from reference images and textual descriptions via a diffusion model. This step enhances the encoder ability of text to capture nuanced intents in textual descriptions. 2) Adaptive Token Fusion. We further fine-tune all encoders contrastively by comparing adaptive token-fusion features with the target image. This mechanism dynamically balances visual and textual representations within the contrastive learning pipeline, optimizing the composed feature for retrieval. Extensive experiments on Fashion-IQ and CIRR datasets demonstrate that TMCIR significantly outperforms state-of-the-art methods, particularly in capturing nuanced user intent.