Orthus: Autoregressive Interleaved Image-Text Generation with Modality-Specific Heads

📅 2024-11-28
🏛️ arXiv.org
📈 Citations: 3
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses information loss and modeling complexity in multimodal interleaved generation arising from joint modeling of discrete text and continuous image features. Methodologically: (1) it introduces modality-specific autoregressive heads—namely, a language modeling (LM) head for text and a diffusion-based head for images—enabling end-to-end joint generation; (2) it replaces hard vector quantization with soft vector quantization (Soft VQ) to preserve the continuity of image features; and (3) it directly models continuous image representations, thereby avoiding reconstruction artifacts inherent in vector-quantized (VQ) approaches. Evaluated at the 7B-parameter scale, the model achieves GenEval = 0.58 and MME-P = 1265.8, substantially outperforming Show-o and Chameleon. Notably, it supports high-fidelity, long-horizon interleaved text-image generation. The proposed framework establishes an efficient, low-distortion autoregressive paradigm for foundational multimodal modeling.

Technology Category

Application Category

📝 Abstract
We introduce Orthus, an autoregressive (AR) transformer that excels in generating images given textual prompts, answering questions based on visual inputs, and even crafting lengthy image-text interleaved contents. Unlike prior arts on unified multimodal modeling, Orthus simultaneously copes with discrete text tokens and continuous image features under the AR modeling principle. The continuous treatment of visual signals minimizes the information loss for both image understanding and generation while the fully AR formulation renders the characterization of the correlation between modalities straightforward. The key mechanism enabling Orthus to leverage these advantages lies in its modality-specific heads -- one regular language modeling (LM) head predicts discrete text tokens and one diffusion head generates continuous image features conditioning on the output of the backbone. We devise an efficient strategy for building Orthus -- by substituting the Vector Quantization (VQ) operation in the existing unified AR model with a soft alternative, introducing a diffusion head, and tuning the added modules to reconstruct images, we can create an Orthus-base model effortlessly (e.g., within mere 72 A100 GPU hours). Orthus-base can further embrace post-training to better model interleaved images and texts. Empirically, Orthus surpasses competing baselines including Show-o and Chameleon across standard benchmarks, achieving a GenEval score of 0.58 and an MME-P score of 1265.8 using 7B parameters. Orthus also shows exceptional mixed-modality generation capabilities, reflecting the potential for handling intricate practical generation tasks.
Problem

Research questions and friction points this paper is trying to address.

Generates images from textual prompts efficiently
Answers questions based on visual inputs accurately
Creates interleaved image-text content seamlessly
Innovation

Methods, ideas, or system contributions that make the work stand out.

Autoregressive transformer with modality-specific heads
Diffusion head for continuous image features
Efficient soft VQ replacement strategy
🔎 Similar Papers
No similar papers found.
Siqi Kou
Siqi Kou
Shanghai Jiaotong university
Machine Learning
J
Jiachun Jin
Qing Yuan Research Institute, SEIEE, Shanghai Jiao Tong University
C
Chang Liu
Qing Yuan Research Institute, SEIEE, Shanghai Jiao Tong University
Y
Ye Ma
Kuaishou Technology
Jian Jia
Jian Jia
Institute of Automation, Chinese Academy of Sciences (CASIA)
computer vision
Q
Quan Chen
Kuaishou Technology
P
Peng Jiang
Kuaishou Technology
Z
Zhijie Deng
Qing Yuan Research Institute, SEIEE, Shanghai Jiao Tong University