From Static to Interactive: Adapting Visual in-Context Learners for User-Driven Tasks

📅 2026-04-08

📈 Citations: 0

✨ Influential: 0

career value

191K/year

🤖 AI Summary

Existing vision-based in-context learning models struggle to respond to user interaction signals—such as scribbles, clicks, or bounding boxes—limiting their applicability in user-guided tasks. This work proposes Interactive DeLVM, which for the first time natively integrates diverse interaction signals directly into visual in-context examples. Without modifying the DeLVM architecture or requiring fine-tuning, the method endows the model with the ability to respond in real time to personalized user guidance. Experimental results demonstrate substantial improvements over existing approaches: a 7.95% increase in IoU on interactive segmentation, a 2.46 dB gain in PSNR for directed super-resolution, and a 3.14% reduction in LPIPS for interactive object removal.

Technology Category

Application Category

📝 Abstract

Visual in-context learning models are designed to adapt to new tasks by leveraging a set of example input-output pairs, enabling rapid generalization without task-specific fine-tuning. However, these models operate in a fundamentally static paradigm: while they can adapt to new tasks, they lack any mechanism to incorporate user-provided guidance signals such as scribbles, clicks, or bounding boxes to steer or refine the prediction process. This limitation is particularly restrictive in real-world applications, where users want to actively guide model predictions, e.g., by highlighting the target object for segmentation, indicating a region which should be visually altered, or isolating a specific person in a complex scene to run targeted pose estimation. In this work, we propose a simple method to transform static visual in-context learners, particularly the DeLVM approach, into highly controllable, user-driven systems, i.e., Interactive DeLVM, enabling seamless interaction through natural visual cues such as scribbles, clicks, or drawing boxes. Specifically, by encoding interactions directly into the example input-output pairs, we keep the philosophy of visual in-context learning intact: enabling users to prompt models with unseen interactions without fine-tuning and empowering them to dynamically steer model predictions with personalized interactions. Our experiments demonstrate that SOTA visual in-context learning models fail to effectively leverage interaction cues, often ignoring user guidance entirely. In contrast, our method excels in controllable, user-guided scenarios, achieving improvements of $+7.95%$ IoU for interactive segmentation, $+2.46$ PSNR for directed super-resolution, and $-3.14%$ LPIPS for interactive object removal. With this, our work bridges the gap between rigid static task adaptation and fluid interactivity for user-centric visual in-context learning.

Problem

Research questions and friction points this paper is trying to address.

visual in-context learning

user interaction

interactive guidance

static paradigm

user-driven tasks

Innovation

Methods, ideas, or system contributions that make the work stand out.

interactive visual in-context learning

user-guided adaptation

DeLVM

visual prompting

scribble-based interaction

🔎 Similar Papers

Cropper: Vision-Language Model for Image Cropping through In-Context Learning

2024-08-14arXiv.orgCitations: 0