IntRec: Intent-based Retrieval with Contrastive Refinement

📅 2026-02-19

📈 Citations: 0

✨ Influential: 0

career value

180K/year

🤖 AI Summary

This work addresses the challenge of object retrieval in complex scenes, where user queries often suffer from semantic ambiguity or high similarity among multiple targets, and existing open-vocabulary detectors lack the ability to refine results through user feedback. To this end, we propose IntRec, an interactive object retrieval framework that models user intent states, maintains positive and negative memory sets, and employs a contrastive alignment function to dynamically refine candidate rankings for fine-grained disambiguation. Notably, IntRec requires no additional supervision and achieves efficient, low-latency interaction (<30 ms per round). On the LVIS benchmark, it attains 35.4 AP, outperforming OVMR, CoDet, and CAKE. Moreover, on the LVIS-Ambiguous benchmark, a single round of feedback yields a 7.9 AP improvement, substantially enhancing retrieval accuracy in ambiguous scenarios.

Technology Category

Application Category

📝 Abstract

Retrieving user-specified objects from complex scenes remains a challenging task, especially when queries are ambiguous or involve multiple similar objects. Existing open-vocabulary detectors operate in a one-shot manner, lacking the ability to refine predictions based on user feedback. To address this, we propose IntRec, an interactive object retrieval framework that refines predictions based on user feedback. At its core is an Intent State (IS) that maintains dual memory sets for positive anchors (confirmed cues) and negative constraints (rejected hypotheses). A contrastive alignment function ranks candidate objects by maximizing similarity to positive cues while penalizing rejected ones, enabling fine-grained disambiguation in cluttered scenes. Our interactive framework provides substantial improvements in retrieval accuracy without additional supervision. On LVIS, IntRec achieves 35.4 AP, outperforming OVMR, CoDet, and CAKE by +2.3, +3.7, and +0.5, respectively. On the challenging LVIS-Ambiguous benchmark, it improves performance by +7.9 AP over its one-shot baseline after a single corrective feedback, with less than 30 ms of added latency per interaction.

Problem

Research questions and friction points this paper is trying to address.

object retrieval

ambiguous queries

complex scenes

open-vocabulary detection

user feedback

Innovation

Methods, ideas, or system contributions that make the work stand out.

interactive object retrieval

intent state

contrastive alignment