FBCIR: Balancing Cross-Modal Focuses in Composed Image Retrieval

📅 2026-03-12
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the significant performance degradation of current compositional image retrieval (CIR) models when confronted with semantically aligned hard negative samples, a problem rooted in imbalanced cross-modal attention. The study is the first to explicitly identify and quantify this issue, proposing FBCIR—a novel approach that leverages multimodal interpretability analysis to diagnose modality-specific attention biases between vision and text. Guided by these insights, FBCIR introduces a tailored data augmentation strategy that incorporates hard negatives to encourage balanced cross-modal reasoning. Extensive experiments demonstrate that FBCIR consistently enhances retrieval performance under challenging conditions across multiple state-of-the-art CIR models, while preserving their original capabilities on standard benchmarks, thereby validating the effectiveness and generalizability of the proposed paradigm.

Technology Category

Application Category

📝 Abstract
Composed image retrieval (CIR) requires multi-modal models to jointly reason over visual content and semantic modifications presented in text-image input pairs. While current CIR models achieve strong performance on common benchmark cases, their accuracies often degrades in more challenging scenarios where negative candidates are semantically aligned with the query image or text. In this paper, we attribute this degradation to focus imbalances, where models disproportionately attend to one modality while neglecting the other. To validate this claim, we propose FBCIR, a multi-modal focus interpretation method that identifies the most crucial visual and textual input components to a model's retrieval decisions. Using FBCIR, we report that focus imbalances are prevalent in existing CIR models, especially under hard negative settings. Building on the analyses, we further propose a CIR data augmentation workflow that facilitates existing CIR datasets with curated hard negatives designed to encourage balanced cross-modal reasoning. Extensive experiments across multiple CIR models demonstrate that the proposed augmentation consistently improves performance in challenging cases, while maintaining their capabilities on standard benchmarks. Together, our interpretation method and data augmentation workflow provide a new perspective on CIR model diagnosis and robustness improvements.
Problem

Research questions and friction points this paper is trying to address.

composed image retrieval
focus imbalance
cross-modal reasoning
hard negatives
multi-modal models
Innovation

Methods, ideas, or system contributions that make the work stand out.

composed image retrieval
cross-modal focus balance
hard negative mining
multimodal interpretation
data augmentation
🔎 Similar Papers
No similar papers found.
C
Chenchen Zhao
Department of Computer Science and Engineering, The Chinese University of Hong Kong
Jianhuan Zhuo
Jianhuan Zhuo
Institute of Information Engineering, Chinese Academy of Sciences
Representation LearningRecommendation System
M
Muxi Chen
Department of Computer Science and Engineering, The Chinese University of Hong Kong
Z
Zhaohua Zhang
School of Mathematical Sciences, Dalian University of Technology
Wenyu Jiang
Wenyu Jiang
Nanjing University
AILLMsVLMs
Tianwen Jiang
Tianwen Jiang
Harbin Institute of Technology
Knowledge GraphInformation ExtractionNatural Language Processing
Q
Qiuyong Xiao
Tencent AI Data Department
J
Jihong Zhang
Tencent AI Data Department
Qiang Xu
Qiang Xu
Professor, The Chinese University of Hong Kong
EDATime SeriesAI Safety