🤖 AI Summary
While autoregressive vision-language models (VLMs) excel in text-to-image generation, their in-context learning (ICL) capability for general-purpose image generation remains systematically unexplored.
Method: We propose the first unified, purely autoregressive VLM framework for ICL-based general image generation. It comprises: (1) a context-aware feature compression encoder that significantly extends effective context length and enhances cross-task generalization; and (2) an LLM-inspired scalable architecture that jointly models text and image autoregression under a shared text–image joint prediction objective.
Results: Our method achieves state-of-the-art performance on multiple seen ICL tasks and demonstrates strong zero-shot generalization to unseen image generation tasks—without fine-tuning or task-specific architectural modifications. This work provides the first empirical validation that purely autoregressive VLMs can support flexible, context-driven, general-purpose image generation in an end-to-end, plug-and-play ICL paradigm.
📝 Abstract
In-context generation is a key component of large language models' (LLMs) open-task generalization capability. By leveraging a few examples as context, LLMs can perform both in-domain and out-of-domain tasks. Recent advancements in auto-regressive vision-language models (VLMs) built upon LLMs have showcased impressive performance in text-to-image generation. However, the potential of in-context learning for general image generation tasks remains largely unexplored. To address this, we introduce X-Prompt, a purely auto-regressive large-vision language model designed to deliver competitive performance across a wide range of both seen and unseen image generation tasks, all within a unified in-context learning framework. X-Prompt incorporates a specialized design that efficiently compresses valuable features from in-context examples, supporting longer in-context token sequences and improving its ability to generalize to unseen tasks. A unified training task for both text and image prediction enables X-Prompt to handle general image generation with enhanced task awareness from in-context examples. Extensive experiments validate the model's performance across diverse seen image generation tasks and its capacity to generalize to previously unseen tasks.