Personalized Vision via Visual In-Context Learning

📅 2025-09-29

📈 Citations: 0

✨ Influential: 0

career value

173K/year

🤖 AI Summary

Current vision models rely heavily on large-scale labeled datasets and excel only on predefined tasks, exhibiting limited generalization to user-specified objects or tasks at inference time. To address this, we propose PICO, the first framework enabling tuning-free, cross-task (recognition and generation) open-ended personalized visual reasoning. Its core is diffusion Transformer-based visual in-context learning: a four-panel layout explicitly models multi-granularity visual relationships, while an attention-guided seed scoring mechanism enables effective context adaptation from just a few annotated examples. We introduce VisRel, a new benchmark dataset supporting diverse personalized vision tasks. Experiments demonstrate that PICO substantially outperforms fine-tuning and synthetic-data baselines across zero-shot task transfer, novel-category recognition, and controllable image generation—highlighting its strong generalization capability and flexibility in real-world personalized scenarios.

Technology Category

Application Category

📝 Abstract

Modern vision models, trained on large-scale annotated datasets, excel at predefined tasks but struggle with personalized vision -- tasks defined at test time by users with customized objects or novel objectives. Existing personalization approaches rely on costly fine-tuning or synthetic data pipelines, which are inflexible and restricted to fixed task formats. Visual in-context learning (ICL) offers a promising alternative, yet prior methods confine to narrow, in-domain tasks and fail to generalize to open-ended personalization. We introduce Personalized In-Context Operator (PICO), a simple four-panel framework that repurposes diffusion transformers as visual in-context learners. Given a single annotated exemplar, PICO infers the underlying transformation and applies it to new inputs without retraining. To enable this, we construct VisRel, a compact yet diverse tuning dataset, showing that task diversity, rather than scale, drives robust generalization. We further propose an attention-guided seed scorer that improves reliability via efficient inference scaling. Extensive experiments demonstrate that PICO (i) surpasses fine-tuning and synthetic-data baselines, (ii) flexibly adapts to novel user-defined tasks, and (iii) generalizes across both recognition and generation.

Problem

Research questions and friction points this paper is trying to address.

Addressing personalized vision tasks without retraining models

Overcoming limitations of fine-tuning and synthetic data methods

Enabling flexible adaptation to user-defined visual transformations

Innovation

Methods, ideas, or system contributions that make the work stand out.

Uses four-panel framework for visual in-context learning

Constructs diverse dataset to enable robust generalization

Employs attention-guided scorer for reliable inference scaling

🔎 Similar Papers

Cropper: Vision-Language Model for Image Cropping through In-Context Learning