OSCAR: Optimization-Steered Agentic Planning for Composed Image Retrieval

📅 2026-02-09
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the complex reasoning challenges in compositional image retrieval arising from heterogeneous visual-textual constraints by proposing OSCAR, a novel framework that formulates the retrieval task as a trajectory optimization problem. OSCAR adopts an offline-online two-stage paradigm: in the offline phase, it leverages mixed-integer programming and Boolean set operations to generate optimal retrieval trajectories and constructs a gold-standard trajectory library; during online inference, this library serves as contextual demonstrations to guide a vision-language model in efficient planning. By circumventing the limitations of unified embedding models and the suboptimal trial-and-error behavior of heuristic agents, OSCAR substantially enhances generalization. Experiments demonstrate that it outperforms state-of-the-art methods across three public benchmarks and an industrial dataset, achieving superior performance with only 10% of the training data.

Technology Category

Application Category

📝 Abstract
Composed image retrieval (CIR) requires complex reasoning over heterogeneous visual and textual constraints. Existing approaches largely fall into two paradigms: unified embedding retrieval, which suffers from single-model myopia, and heuristic agentic retrieval, which is limited by suboptimal, trial-and-error orchestration. To this end, we propose OSCAR, an optimization-steered agentic planning framework for composed image retrieval. We are the first to reformulate agentic CIR from a heuristic search process into a principled trajectory optimization problem. Instead of relying on heuristic trial-and-error exploration, OSCAR employs a novel offline-online paradigm. In the offline phase, we model CIR via atomic retrieval selection and composition as a two-stage mixed-integer programming problem, mathematically deriving optimal trajectories that maximize ground-truth coverage for training samples via rigorous boolean set operations. These trajectories are then stored in a golden library to serve as in-context demonstrations for online steering of VLM planner at online inference time. Extensive experiments on three public benchmarks and a private industrial benchmark show that OSCAR consistently outperforms SOTA baselines. Notably, it achieves superior performance using only 10% of training data, demonstrating strong generalization of planning logic rather than dataset-specific memorization.
Problem

Research questions and friction points this paper is trying to address.

Composed Image Retrieval
Heterogeneous Constraints
Agentic Planning
Retrieval Optimization
Visual-Textual Reasoning
Innovation

Methods, ideas, or system contributions that make the work stand out.

composed image retrieval
trajectory optimization
mixed-integer programming
agentic planning
visual-language model
🔎 Similar Papers
No similar papers found.
Teng Wang
Teng Wang
AI Researcher @ OPPO Research Institute
AILLM ReasoningNLP
R
Rong Shan
Shanghai Jiao Tong University, Shanghai, China
Jianghao Lin
Jianghao Lin
Shanghai Jiao Tong University
Large Language ModelsAI AgentsRecommender Systems
Junjie Wu
Junjie Wu
Center for High Pressure Science & Technology Advanced Research
Physics
Tianyi Xu
Tianyi Xu
Tulane University
Reinforcement LearningNetwork OptimizaitonStatisticsNLP(LLM)Operations research
J
Jianping Zhang
Shanghai Jiao Tong University, Shanghai, China
W
Wenteng Chen
Shanghai Jiao Tong University, Shanghai, China
C
Changwang Zhang
OPPO, Shenzhen, China
Z
Zhaoxiang Wang
OPPO, Shenzhen, China
Weinan Zhang
Weinan Zhang
Professor, Shanghai Jiao Tong University
Reinforcement LearningAgentsData Science
J
Jun Wang
OPPO, Shenzhen, China