FocalLens: Instruction Tuning Enables Zero-Shot Conditional Image Representations

📅 2025-04-11

📈 Citations: 0

✨ Influential: 0

career value

175K/year

🤖 AI Summary

Existing image encoding paradigms rely on fixed, generic features, limiting adaptability to diverse downstream tasks’ heterogeneous visual information requirements. This paper proposes FocalLens—the first framework to introduce natural language instruction tuning into visual encoding—enabling task-directed, context-aware, and diverse representations for a single image under zero-shot conditional encoding. Built upon a pre-trained vision encoder, FocalLens performs end-to-end conditional fine-tuning using vision-instruction tuning data and contrastive learning. On the SugarCrepe and MMVP-VLM benchmarks, it achieves average improvements of 5% and 10%, respectively, significantly enhancing image–image retrieval, image–text retrieval, and image classification. Its core contribution lies in breaking the static feature paradigm, establishing a novel, language-controllable, and dynamically adaptable visual representation framework.

Technology Category

Application Category

📝 Abstract

Visual understanding is inherently contextual -- what we focus on in an image depends on the task at hand. For instance, given an image of a person holding a bouquet of flowers, we may focus on either the person such as their clothing, or the type of flowers, depending on the context of interest. Yet, most existing image encoding paradigms represent an image as a fixed, generic feature vector, overlooking the potential needs of prioritizing varying visual information for different downstream use cases. In this work, we introduce FocalLens, a conditional visual encoding method that produces different representations for the same image based on the context of interest, expressed flexibly through natural language. We leverage vision instruction tuning data and contrastively finetune a pretrained vision encoder to take natural language instructions as additional inputs for producing conditional image representations. Extensive experiments validate that conditional image representation from FocalLens better pronounce the visual features of interest compared to generic features produced by standard vision encoders like CLIP. In addition, we show FocalLens further leads to performance improvements on a range of downstream tasks including image-image retrieval, image classification, and image-text retrieval, with an average gain of 5 and 10 points on the challenging SugarCrepe and MMVP-VLM benchmarks, respectively.

Problem

Research questions and friction points this paper is trying to address.

Enables zero-shot conditional image representations via instruction tuning

Addresses fixed generic feature vectors in image encoding paradigms

Improves downstream tasks with context-aware visual feature prioritization

Innovation

Methods, ideas, or system contributions that make the work stand out.

Conditional image encoding via natural language instructions

Vision instruction tuning for contextual representations

Contrastive fine-tuning of pretrained vision encoder

🔎 Similar Papers

What Do You See? Enhancing Zero-Shot Image Classification with Multimodal Large Language Models