🤖 AI Summary
Visual in-context learning (VICL) suffers from prediction bias and instability due to its reliance on a single context example. To address this, we propose PANICL—a training-free, general-purpose visual in-context learning framework. Our method introduces image-patch-level k-nearest neighbor retrieval to enable multi-example similarity matching, and integrates dynamic weighting with feature-space alignment to smoothly fuse prediction scores across multiple context examples. This design effectively mitigates single-example bias, substantially improving prediction stability and cross-task and cross-model generalization. Extensive experiments demonstrate that PANICL consistently outperforms strong baselines across diverse vision tasks—including classification, segmentation, and detection—while maintaining robustness under dataset shift and label-space changes. These results validate its generality, scalability, and practical applicability in real-world vision systems.
📝 Abstract
Visual In-Context Learning (VICL) uses input-output image pairs, referred to as in-context pairs (or examples), as prompts alongside query images to guide models in performing diverse vision tasks. However, VICL often suffers from over-reliance on a single in-context pair, which can lead to biased and unstable predictions. We introduce PAtch-based $k$-Nearest neighbor visual In-Context Learning (PANICL), a general training-free framework that mitigates this issue by leveraging multiple in-context pairs. PANICL smooths assignment scores across pairs, reducing bias without requiring additional training. Extensive experiments on a variety of tasks, including foreground segmentation, single object detection, colorization, multi-object segmentation, and keypoint detection, demonstrate consistent improvements over strong baselines. Moreover, PANICL exhibits strong robustness to domain shifts, including dataset-level shift (e.g., from COCO to Pascal) and label-space shift (e.g., FSS-1000), and generalizes well to other VICL models such as SegGPT, Painter, and LVM, highlighting its versatility and broad applicability.