Missing Target-Relevant Information Prediction with World Model for Accurate Zero-Shot Composed Image Retrieval

📅 2025-03-21
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
In zero-shot compositional image retrieval (ZS-CIR), inaccurate retrieval arises when the reference image lacks key visual content of the target. To address this, we propose PrediCIR, a predictive mapping network that introduces a world-model-guided latent-space content prediction mechanism. It implicitly infers and synthesizes pseudo-token representations for missing regions solely from user-provided textual instructions—requiring no additional supervision or explicit image editing. By performing text-guided, adaptive mapping from source to target view in latent space, PrediCIR achieves fine-grained semantic alignment. Evaluated on six ZS-CIR benchmarks, it consistently outperforms state-of-the-art methods, yielding average improvements of 1.73%–4.45% in retrieval accuracy. The implementation is publicly available.

Technology Category

Application Category

📝 Abstract
Zero-Shot Composed Image Retrieval (ZS-CIR) involves diverse tasks with a broad range of visual content manipulation intent across domain, scene, object, and attribute. The key challenge for ZS-CIR tasks is to modify a reference image according to manipulation text to accurately retrieve a target image, especially when the reference image is missing essential target content. In this paper, we propose a novel prediction-based mapping network, named PrediCIR, to adaptively predict the missing target visual content in reference images in the latent space before mapping for accurate ZS-CIR. Specifically, a world view generation module first constructs a source view by omitting certain visual content of a target view, coupled with an action that includes the manipulation intent derived from existing image-caption pairs. Then, a target content prediction module trains a world model as a predictor to adaptively predict the missing visual information guided by user intention in manipulating text at the latent space. The two modules map an image with the predicted relevant information to a pseudo-word token without extra supervision. Our model shows strong generalization ability on six ZS-CIR tasks. It obtains consistent and significant performance boosts ranging from 1.73% to 4.45% over the best methods and achieves new state-of-the-art results on ZS-CIR. Our code is available at https://github.com/Pter61/predicir.
Problem

Research questions and friction points this paper is trying to address.

Predict missing target content in reference images for retrieval
Adaptively predict visual information using manipulation intent
Improve zero-shot composed image retrieval accuracy across domains
Innovation

Methods, ideas, or system contributions that make the work stand out.

Predicts missing visual content adaptively
Uses world model for latent space prediction
Maps images to pseudo-word tokens unsupervised
🔎 Similar Papers
No similar papers found.
Yuanmin Tang
Yuanmin Tang
University of Chinese Academy of Sciences
Machine learning
Jing Yu
Jing Yu
Northwestern University
SustainabilityLife Cycle AnalysisTransportation ManagementOperations Research
Keke Gai
Keke Gai
Beijing Institute of Technology
Cyber SecurityBlockchainAI SecurityPrivacy-preserving ComputationFinTech
J
Jiamin Zhuang
Institute of Information Engineering, Chinese Academy of Sciences; School of Cyber Security, University of Chinese Academy of Sciences
G
Gang Xiong
Institute of Information Engineering, Chinese Academy of Sciences
G
Gaopeng Gou
Institute of Information Engineering, Chinese Academy of Sciences
Q
Qi Wu
University of Adelaide