Pix2Key: Controllable Open-Vocabulary Retrieval with Semantic Decomposition and Self-Supervised Visual Dictionary Learning

📅 2026-02-25

📈 Citations: 0

✨ Influential: 0

career value

177K/year

🤖 AI Summary

This work addresses the limitations of existing compositional image retrieval (CIR) methods—namely, the loss of fine-grained cues, neglect of user implicit intent, and result redundancy—stemming from reliance on supervised triplets or text fusion. To overcome these issues, we propose a unified framework grounded in an open-vocabulary visual dictionary. Our approach leverages a semantic decomposition-driven visual dictionary representation to enable intent-aware constrained matching and diversity-aware re-ranking within a shared embedding space. Additionally, we introduce V-Dict-AE, a self-supervised pretraining module that operates without CIR-specific supervision, enhancing both fine-grained attribute comprehension and retrieval diversity. Evaluated on the DFMM-Compose benchmark, our method achieves a 3.2-point gain in Recall@10, with a further 2.3-point improvement upon incorporating V-Dict-AE, while significantly boosting intent consistency and result list diversity.

Technology Category

Application Category

📝 Abstract

Composed Image Retrieval (CIR) uses a reference image plus a natural-language edit to retrieve images that apply the requested change while preserving other relevant visual content. Classic fusion pipelines typically rely on supervised triplets and can lose fine-grained cues, while recent zero-shot approaches often caption the reference image and merge the caption with the edit, which may miss implicit user intent and return repetitive results. We present Pix2Key, which represents both queries and candidates as open-vocabulary visual dictionaries, enabling intent-aware constraint matching and diversity-aware reranking in a unified embedding space. A self-supervised pretraining component, V-Dict-AE, further improves the dictionary representation using only images, strengthening fine-grained attribute understanding without CIR-specific supervision. On the DFMM-Compose benchmark, Pix2Key improves Recall@10 up to 3.2 points, and adding V-Dict-AE yields an additional 2.3-point gain while improving intent consistency and maintaining high list diversity.

Problem

Research questions and friction points this paper is trying to address.

Composed Image Retrieval

open-vocabulary

visual dictionary

intent awareness

retrieval diversity

Innovation

Methods, ideas, or system contributions that make the work stand out.

visual dictionary learning

self-supervised pretraining

open-vocabulary retrieval