🤖 AI Summary
Visual in-context learning (ICL) suffers from strong dependence on prompt quality and limited generalization. To address this, we propose a learnable perturbation-augmented prompt optimization framework: for the first time, we introduce learnable image perturbations into visual context pairs and perform end-to-end differentiable optimization of input-output image pairings to enhance zero-shot inference on query images—without fine-tuning model parameters, ensuring lightweight efficiency. Our core contribution lies in formulating prompt optimization as a differentiable perturbation learning problem that jointly preserves semantic consistency and discriminability. Extensive experiments demonstrate significant improvements: +7.99 mIoU on foreground segmentation and +17.04 AP on single-object detection, substantially outperforming existing visual ICL approaches. These results validate the method’s effectiveness and generalizability in few-shot vision tasks.
📝 Abstract
Large-scale models trained on extensive datasets have become the standard due to their strong generalizability across diverse tasks. In-context learning (ICL), widely used in natural language processing, leverages these models by providing task-specific prompts without modifying their parameters. This paradigm is increasingly being adapted for computer vision, where models receive an input-output image pair, known as an in-context pair, alongside a query image to illustrate the desired output. However, the success of visual ICL largely hinges on the quality of these prompts. To address this, we propose Enhanced Instruct Me More (E-InMeMo), a novel approach that incorporates learnable perturbations into in-context pairs to optimize prompting. Through extensive experiments on standard vision tasks, E-InMeMo demonstrates superior performance over existing state-of-the-art methods. Notably, it improves mIoU scores by 7.99 for foreground segmentation and by 17.04 for single object detection when compared to the baseline without learnable prompts. These results highlight E-InMeMo as a lightweight yet effective strategy for enhancing visual ICL. Code is publicly available at: https://github.com/Jackieam/E-InMeMo