X-Prompt: Towards Universal In-Context Image Generation in Auto-Regressive Vision Language Foundation Models

📅 2024-12-02
🏛️ arXiv.org
📈 Citations: 9
Influential: 0
📄 PDF
🤖 AI Summary
While autoregressive vision-language models (VLMs) excel in text-to-image generation, their in-context learning (ICL) capability for general-purpose image generation remains systematically unexplored. Method: We propose the first unified, purely autoregressive VLM framework for ICL-based general image generation. It comprises: (1) a context-aware feature compression encoder that significantly extends effective context length and enhances cross-task generalization; and (2) an LLM-inspired scalable architecture that jointly models text and image autoregression under a shared text–image joint prediction objective. Results: Our method achieves state-of-the-art performance on multiple seen ICL tasks and demonstrates strong zero-shot generalization to unseen image generation tasks—without fine-tuning or task-specific architectural modifications. This work provides the first empirical validation that purely autoregressive VLMs can support flexible, context-driven, general-purpose image generation in an end-to-end, plug-and-play ICL paradigm.

Technology Category

Application Category

📝 Abstract
In-context generation is a key component of large language models' (LLMs) open-task generalization capability. By leveraging a few examples as context, LLMs can perform both in-domain and out-of-domain tasks. Recent advancements in auto-regressive vision-language models (VLMs) built upon LLMs have showcased impressive performance in text-to-image generation. However, the potential of in-context learning for general image generation tasks remains largely unexplored. To address this, we introduce X-Prompt, a purely auto-regressive large-vision language model designed to deliver competitive performance across a wide range of both seen and unseen image generation tasks, all within a unified in-context learning framework. X-Prompt incorporates a specialized design that efficiently compresses valuable features from in-context examples, supporting longer in-context token sequences and improving its ability to generalize to unseen tasks. A unified training task for both text and image prediction enables X-Prompt to handle general image generation with enhanced task awareness from in-context examples. Extensive experiments validate the model's performance across diverse seen image generation tasks and its capacity to generalize to previously unseen tasks.
Problem

Research questions and friction points this paper is trying to address.

Enabling universal in-context image generation tasks
Improving generalization to unseen image generation challenges
Enhancing auto-regressive vision models' task awareness
Innovation

Methods, ideas, or system contributions that make the work stand out.

Auto-regressive vision-language model for universal tasks
Efficient compression of in-context example features
Unified training for text and image prediction
🔎 Similar Papers
No similar papers found.
Z
Zeyi Sun
Shanghai Jiao Tong University, Shanghai AI Laboratory
Z
Ziyang Chu
Shanghai AI Laboratory, Tsinghua University
P
Pan Zhang
Shanghai AI Laboratory
T
Tong Wu
The Chinese University of Hong Kong, Shanghai AI Laboratory
X
Xiao-wen Dong
Shanghai AI Laboratory
Yuhang Zang
Yuhang Zang
Shanghai AI Laboratory
Natural Language ProcessingVision Language Model
Yuanjun Xiong
Yuanjun Xiong
Adobe Firefly
Computer VisionPattern RecognitionMachine Learning
Dahua Lin
Dahua Lin
The Chinese University of Hong Kong
computer visionmachine learningprobabilistic inferencebayesian nonparametrics
J
Jiaqi Wang
Shanghai AI Laboratory