E-InMeMo: Enhanced Prompting for Visual In-Context Learning

📅 2025-04-25
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Visual in-context learning (ICL) suffers from strong dependence on prompt quality and limited generalization. To address this, we propose a learnable perturbation-augmented prompt optimization framework: for the first time, we introduce learnable image perturbations into visual context pairs and perform end-to-end differentiable optimization of input-output image pairings to enhance zero-shot inference on query images—without fine-tuning model parameters, ensuring lightweight efficiency. Our core contribution lies in formulating prompt optimization as a differentiable perturbation learning problem that jointly preserves semantic consistency and discriminability. Extensive experiments demonstrate significant improvements: +7.99 mIoU on foreground segmentation and +17.04 AP on single-object detection, substantially outperforming existing visual ICL approaches. These results validate the method’s effectiveness and generalizability in few-shot vision tasks.

Technology Category

Application Category

📝 Abstract
Large-scale models trained on extensive datasets have become the standard due to their strong generalizability across diverse tasks. In-context learning (ICL), widely used in natural language processing, leverages these models by providing task-specific prompts without modifying their parameters. This paradigm is increasingly being adapted for computer vision, where models receive an input-output image pair, known as an in-context pair, alongside a query image to illustrate the desired output. However, the success of visual ICL largely hinges on the quality of these prompts. To address this, we propose Enhanced Instruct Me More (E-InMeMo), a novel approach that incorporates learnable perturbations into in-context pairs to optimize prompting. Through extensive experiments on standard vision tasks, E-InMeMo demonstrates superior performance over existing state-of-the-art methods. Notably, it improves mIoU scores by 7.99 for foreground segmentation and by 17.04 for single object detection when compared to the baseline without learnable prompts. These results highlight E-InMeMo as a lightweight yet effective strategy for enhancing visual ICL. Code is publicly available at: https://github.com/Jackieam/E-InMeMo
Problem

Research questions and friction points this paper is trying to address.

Enhancing visual in-context learning via optimized prompting
Improving performance in vision tasks with learnable perturbations
Boosting mIoU scores for segmentation and object detection
Innovation

Methods, ideas, or system contributions that make the work stand out.

Uses learnable perturbations for better prompts
Enhances visual in-context learning performance
Improves segmentation and object detection scores
J
Jiahao Zhang
D3 Center, The University of Osaka, Osaka, 565-0871, Japan
B
Bowen Wang
D3 Center, The University of Osaka, Osaka, 565-0871, Japan
H
Hong Liu
School of Informatics, Xiamen University, Xiamen, 361000, China
L
Liangzhi Li
Meetyou AI Lab, Xiamen Meet You Co., Ltd, Xiamen, 361000, China
Yuta Nakashima
Yuta Nakashima
SANKEN, The University of Osaka
Computer VisionPattern RecognitionNatural Language Processing
Hajime Nagahara
Hajime Nagahara
Professor of Osaka University
Computational PhotographyComputer Vision