Teaching VLMs to Localize Specific Objects from In-context Examples

📅 2024-11-20
🏛️ arXiv.org
📈 Citations: 1
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses a core limitation of vision-language models (VLMs): their inability to precisely localize specific objects in few-shot settings relying solely on visual context—especially when textual descriptions are ambiguous or multiple semantically similar objects coexist. To tackle this, we formalize the *personalized few-shot localization* task: given a small set of annotated context images, the model must localize the same object category in a novel query image. Methodologically, we introduce (i) the first dedicated benchmark for this task; (ii) pseudo-name label regularization, which suppresses language priors and strengthens reliance on visual context; and (iii) context-aware instruction tuning data derived from video object tracking sequences. Extensive experiments across VLMs ranging from 7B to 72B parameters demonstrate consistent, significant improvements over state-of-the-art methods on multiple custom benchmarks. Our work is the first to systematically identify and bridge the critical gap in context-driven visual localization capability within modern VLMs.

Technology Category

Application Category

📝 Abstract
Vision-Language Models (VLMs) have shown remarkable capabilities across diverse visual tasks, including image recognition, video understanding, and Visual Question Answering (VQA) when explicitly trained for these tasks. Despite these advances, we find that present-day VLMs (including the proprietary GPT-4o) lack a fundamental cognitive ability: learning to localize specific objects in a scene by taking into account the context. In this work, we focus on the task of few-shot personalized localization, where a model is given a small set of annotated images (in-context examples) -- each with a category label and bounding box -- and is tasked with localizing the same object type in a query image. Personalized localization can be particularly important in cases of ambiguity of several related objects that can respond to a text or an object that is hard to describe with words. To provoke personalized localization abilities in models, we present a data-centric solution that fine-tunes them using carefully curated data from video object tracking datasets. By leveraging sequences of frames tracking the same object across multiple shots, we simulate instruction-tuning dialogues that promote context awareness. To reinforce this, we introduce a novel regularization technique that replaces object labels with pseudo-names, ensuring the model relies on visual context rather than prior knowledge. Our method significantly enhances the few-shot localization performance of recent VLMs ranging from 7B to 72B in size, without sacrificing generalization, as demonstrated on several benchmarks tailored towards evaluating personalized localization abilities. This work is the first to explore and benchmark personalized few-shot localization for VLMs -- exposing critical weaknesses in present-day VLMs, and laying a foundation for future research in context-driven vision-language applications.
Problem

Research questions and friction points this paper is trying to address.

VLMs lack ability to localize objects using context.
Few-shot personalized localization from annotated images.
Enhancing VLMs' context awareness for object localization.
Innovation

Methods, ideas, or system contributions that make the work stand out.

Fine-tunes VLMs with video object tracking data
Uses pseudo-names to replace object labels
Simulates instruction-tuning dialogues for context awareness
🔎 Similar Papers
No similar papers found.
Sivan Doveh
Sivan Doveh
Weizmann Institute of Science; Google
N
Nimrod Shabtay
Tel Aviv University
W
Wei Lin
JKU Linz
E
Eli Schwartz
IBM Research
H
Hildegard Kuehne
Tuebingen AI Center
Raja Giryes
Raja Giryes
Professor, Tel Aviv University
Visual Language ModelsSignal and Image ProcessingGenerative AIDeep Learning
Rogerio Feris
Rogerio Feris
Research Manager, MIT-IBM Watson AI Lab
Computer VisionMachine LearningArtificial Intelligence
Leonid Karlinsky
Leonid Karlinsky
Principal Research Scientist, MIT-IBM Watson AI Lab, IBM Research
Computer Vision
J
James R. Glass
MIT CSAIL
A
Assaf Arbelle
IBM Research
S
S. Ullman
Weizmann Institute of Science
M
M. Jehanzeb Mirza
MIT CSAIL