🤖 AI Summary
This work addresses limitations in existing methods for compositional zero-shot image retrieval, which suffer from conflating textual modifiers with target attribute cues and entangling endpoint alignment with semantic transfer learning within a shared adapter. To overcome these issues, the authors propose decoupling these objectives by training two separate low-rank textual adapter branches and merging them into a single deployable module via Low-Rank Directional Merging (LRDM). The approach introduces, for the first time, bidirectional editing tuples as supervision signals to effectively enhance semantic transfer capability. Evaluated on multiple benchmarks—including CIRR, CIRCO, FashionIQ, and GeneCIS—the method significantly outperforms current state-of-the-art techniques while maintaining lightweight inference.
📝 Abstract
Zero-shot composed image retrieval (ZS-CIR) retrieves a target image from a reference image and a text modification without human-annotated CIR triplets. Projection-based ZS-CIR methods are attractive because they do not rely on LLMs at inference and remain lightweight, but they often underperform LLM-based approaches on complex semantic modifications. This gap reflects a semantic transition bottleneck in projection-based ZS-CIR: endpoint-level matching can let the edit text act as a target-side attribute cue rather than grounding it as a source-conditioned semantic transition. We further show that adding semantic transition supervision to the same text adapter creates an endpoint--transition conflict between endpoint alignment and semantic transition alignment. To address this conflict, DeCIR decouples endpoint and transition learning. It constructs paired forward/reverse edit tuples from image-caption pairs, trains separate low-rank text adapter branches for endpoint alignment and semantic transition alignment, and merges them with Low-Rank Directional Merge (LRDM) into one deployable adapter. Extensive experiments on CIRR, CIRCO, FashionIQ, and GeneCIS demonstrate that DeCIR consistently improves projection-based ZS-CIR without increasing inference complexity.