IIR-VLM: In-Context Instance-level Recognition for Large Vision-Language Models

📅 2026-01-20

📈 Citations: 0

✨ Influential: 0

career value

212K/year

🤖 AI Summary

This work addresses the challenge that large vision-language models struggle to distinguish individual instances—such as specific people or objects—thereby limiting their applicability in personalized scenarios. To overcome this limitation, the authors propose an auxiliary visual encoding mechanism that leverages a pre-trained instance-level recognition expert model to provide specialized features to the large vision-language model. This enables the model to perform few-shot, context-aware learning and achieve fine-grained understanding of novel instances from a single example, without requiring extensive instance-specific data or additional training. The approach is the first to enable in-context one-shot instance-level recognition in large vision-language models and supports cross-category instance perception. Evaluated on both existing and newly curated multi-category benchmarks—including faces, persons, pets, and general objects—the method significantly outperforms current state-of-the-art approaches.

Technology Category

Application Category

📝 Abstract

Instance-level recognition (ILR) concerns distinguishing individual instances from one another, with person re-identification as a prominent example. Despite the impressive visual perception capabilities of modern VLMs, we find their performance on ILR unsatisfactory, often dramatically underperforming domain-specific ILR models. This limitation hinders many practical application of VLMs, e.g. where recognizing familiar people and objects is crucial for effective visual understanding. Existing solutions typically learn to recognize instances one at a time using instance-specific datasets, which not only incur substantial data collection and training costs but also struggle with fine-grained discrimination. In this work, we propose IIR-VLM, a VLM enhanced for In-context Instance-level Recognition. We integrate pre-trained ILR expert models as auxiliary visual encoders to provide specialized features for learning diverse instances, which enables VLMs to learn new instances in-context in a one-shot manner. Further, IIR-VLM leverages this knowledge for instance-aware visual understanding. We validate IIR-VLM's efficacy on existing instance personalization benchmarks. Finally, we demonstrate its superior ILR performance on a challenging new benchmark, which assesses ILR capabilities across varying difficulty and diverse categories, with person, face, pet and general objects as the instances at task.

Problem

Research questions and friction points this paper is trying to address.

Instance-level Recognition

Vision-Language Models

Person Re-identification

In-context Learning

Visual Understanding

Innovation

Methods, ideas, or system contributions that make the work stand out.

In-context Learning

Instance-level Recognition

Vision-Language Models