OneCAT: Decoder-Only Auto-Regressive Model for Unified Understanding and Generation

📅 2025-09-03

📈 Citations: 0

✨ Influential: 0

career value

186K/year

🤖 AI Summary

Existing multimodal models struggle to jointly support understanding, generation, and editing, while suffering from low efficiency in high-resolution processing and excessive autoregressive decoding steps. To address these limitations, this paper proposes OneCAT—a unified decoder-only multimodal model. Its key contributions are: (1) eliminating vision Transformers and visual tokenizers by directly modeling raw pixel sequences; (2) incorporating modality-specific Mixture-of-Experts (MoE) layers and multi-scale visual autoregressive modeling to enable dynamic-resolution input handling; and (3) unifying understanding, generation, and editing into a single end-to-end training framework via one shared autoregressive objective. Experiments demonstrate that OneCAT consistently outperforms leading open-source multimodal models across all three task categories, achieves significantly fewer decoding steps, and sets new state-of-the-art performance.

Technology Category

Application Category

📝 Abstract

We introduce OneCAT, a unified multimodal model that seamlessly integrates understanding, generation, and editing within a novel, pure decoder-only transformer architecture. Our framework uniquely eliminates the need for external components such as Vision Transformers (ViT) or vision tokenizer during inference, leading to significant efficiency gains, especially for high-resolution inputs. This is achieved through a modality-specific Mixture-of-Experts (MoE) structure trained with a single autoregressive (AR) objective, which also natively supports dynamic resolutions. Furthermore, we pioneer a multi-scale visual autoregressive mechanism within the Large Language Model (LLM) that drastically reduces decoding steps compared to diffusion-based methods while maintaining state-of-the-art performance. Our findings demonstrate the powerful potential of pure autoregressive modeling as a sufficient and elegant foundation for unified multimodal intelligence. As a result, OneCAT sets a new performance standard, outperforming existing open-source unified multimodal models across benchmarks for multimodal generation, editing, and understanding.

Problem

Research questions and friction points this paper is trying to address.

Unified multimodal understanding, generation, and editing integration

Eliminating external vision components for efficiency

Reducing decoding steps while maintaining performance

Innovation

Methods, ideas, or system contributions that make the work stand out.

Decoder-only transformer architecture without external components

Modality-specific Mixture-of-Experts with single AR objective

Multi-scale visual autoregressive mechanism reducing decoding steps

🔎 Similar Papers

Non-autoregressive Sequence-to-Sequence Vision-Language Models

2024-03-04Computer Vision and Pattern RecognitionCitations: 3

Show-o: One Single Transformer to Unify Multimodal Understanding and Generation

2024-08-22International Conference on Learning RepresentationsCitations: 292

TikTok

San Jose, California

AI Research Scientist, VLM (vision language models)