E-InMeMo: Enhanced Prompting for Visual In-Context Learning

📅 2025-04-25

📈 Citations: 0

✨ Influential: 0

career value

210K/year

🤖 AI Summary

Visual in-context learning (ICL) suffers from strong dependence on prompt quality and limited generalization. To address this, we propose a learnable perturbation-augmented prompt optimization framework: for the first time, we introduce learnable image perturbations into visual context pairs and perform end-to-end differentiable optimization of input-output image pairings to enhance zero-shot inference on query images—without fine-tuning model parameters, ensuring lightweight efficiency. Our core contribution lies in formulating prompt optimization as a differentiable perturbation learning problem that jointly preserves semantic consistency and discriminability. Extensive experiments demonstrate significant improvements: +7.99 mIoU on foreground segmentation and +17.04 AP on single-object detection, substantially outperforming existing visual ICL approaches. These results validate the method’s effectiveness and generalizability in few-shot vision tasks.

Technology Category

Application Category

📝 Abstract

Large-scale models trained on extensive datasets have become the standard due to their strong generalizability across diverse tasks. In-context learning (ICL), widely used in natural language processing, leverages these models by providing task-specific prompts without modifying their parameters. This paradigm is increasingly being adapted for computer vision, where models receive an input-output image pair, known as an in-context pair, alongside a query image to illustrate the desired output. However, the success of visual ICL largely hinges on the quality of these prompts. To address this, we propose Enhanced Instruct Me More (E-InMeMo), a novel approach that incorporates learnable perturbations into in-context pairs to optimize prompting. Through extensive experiments on standard vision tasks, E-InMeMo demonstrates superior performance over existing state-of-the-art methods. Notably, it improves mIoU scores by 7.99 for foreground segmentation and by 17.04 for single object detection when compared to the baseline without learnable prompts. These results highlight E-InMeMo as a lightweight yet effective strategy for enhancing visual ICL. Code is publicly available at: https://github.com/Jackieam/E-InMeMo

Problem

Research questions and friction points this paper is trying to address.

Enhancing visual in-context learning via optimized prompting

Improving performance in vision tasks with learnable perturbations

Boosting mIoU scores for segmentation and object detection

Innovation

Methods, ideas, or system contributions that make the work stand out.

Uses learnable perturbations for better prompts

Enhances visual in-context learning performance

Improves segmentation and object detection scores

🔎 Similar Papers

Cropper: Vision-Language Model for Image Cropping through In-Context Learning