Pisces: An Auto-regressive Foundation Model for Image Understanding and Generation

📅 2025-06-12
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Image understanding and generation tasks impose fundamentally distinct requirements on visual representations and training paradigms, limiting the performance of unified multimodal models. Method: We propose the first decoupled autoregressive multimodal foundation model, featuring a dual-path vision encoder that separately models understanding-oriented and generation-oriented visual representations, coupled with a modality-customized training strategy integrating multi-stage data curation, pretraining, and cross-task joint optimization. Contribution/Results: Our model achieves state-of-the-art performance across over 20 image understanding benchmarks and demonstrates robust generation quality on GenEval. Crucially, it provides the first empirical validation of synergistic gains between understanding and generation capabilities—overcoming the inherent limitations of single-encoder architectures and establishing a new paradigm for multimodal foundation models.

Technology Category

Application Category

📝 Abstract
Recent advances in large language models (LLMs) have enabled multimodal foundation models to tackle both image understanding and generation within a unified framework. Despite these gains, unified models often underperform compared to specialized models in either task. A key challenge in developing unified models lies in the inherent differences between the visual features needed for image understanding versus generation, as well as the distinct training processes required for each modality. In this work, we introduce Pisces, an auto-regressive multimodal foundation model that addresses this challenge through a novel decoupled visual encoding architecture and tailored training techniques optimized for multimodal generation. Combined with meticulous data curation, pretraining, and finetuning, Pisces achieves competitive performance in both image understanding and image generation. We evaluate Pisces on over 20 public benchmarks for image understanding, where it demonstrates strong performance across a wide range of tasks. Additionally, on GenEval, a widely adopted benchmark for image generation, Pisces exhibits robust generative capabilities. Our extensive analysis reveals the synergistic relationship between image understanding and generation, and the benefits of using separate visual encoders, advancing the field of unified multimodal models.
Problem

Research questions and friction points this paper is trying to address.

Bridging performance gap between unified and specialized multimodal models
Resolving visual feature conflicts in image understanding vs generation
Developing joint training techniques for multimodal foundation models
Innovation

Methods, ideas, or system contributions that make the work stand out.

Decoupled visual encoding architecture
Tailored multimodal generation training
Meticulous data curation and pretraining
🔎 Similar Papers
No similar papers found.