AnyGPT: Unified Multimodal LLM with Discrete Sequence Modeling

📅 2024-02-19
🏛️ Annual Meeting of the Association for Computational Linguistics
📈 Citations: 158
Influential: 11
📄 PDF
🤖 AI Summary
This work addresses the challenge of enabling arbitrary cross-modal generation and understanding—spanning speech, text, images, and music—without modifying LLM architectures or training paradigms. Methodologically, it introduces a unified multimodal modeling framework grounded in purely data-level discrete representations (e.g., phonemes, image patches, musical events), constructs a text-centric cross-modal alignment pretraining dataset, and pioneers a 108K-sample any-to-any instruction-tuning dataset. All modalities are mapped to discrete token sequences and processed by a standard Transformer. Experiments demonstrate that the resulting model matches or approaches state-of-the-art specialized models across diverse multimodal tasks and supports end-to-end multimodal dialogue with arbitrary input–output modality combinations. This validates discrete sequence modeling as an effective, concise, and scalable unifying interface for multimodal AI.

Technology Category

Application Category

📝 Abstract
We introduce AnyGPT, an any-to-any multimodal language model that utilizes discrete representations for the unified processing of various modalities, including speech, text, images, and music. AnyGPT can be trained stably without any alterations to the current large language model (LLM) architecture or training paradigms. Instead, it relies exclusively on data-level preprocessing, facilitating the seamless integration of new modalities into LLMs, akin to the incorporation of new languages. We build a multimodal text-centric dataset for multimodal alignment pre-training. Utilizing generative models, we synthesize the first large-scale any-to-any multimodal instruction dataset. It consists of 108k samples of multi-turn conversations that intricately interweave various modalities, thus equipping the model to handle arbitrary combinations of multimodal inputs and outputs. Experimental results demonstrate that AnyGPT is capable of facilitating any-to-any multimodal conversation while achieving performance comparable to specialized models across all modalities, proving that discrete representations can effectively and conveniently unify multiple modalities within a language model. Demos are shown in https://junzhan2000.github.io/AnyGPT.github.io/
Problem

Research questions and friction points this paper is trying to address.

Unifying multimodal processing with discrete sequence modeling
Enabling any-to-any modality conversion without architectural changes
Handling arbitrary combinations of speech, text, images, music
Innovation

Methods, ideas, or system contributions that make the work stand out.

Discrete representations unify multimodal processing
Data-level preprocessing enables stable LLM training
Synthetic dataset facilitates any-to-any conversations
🔎 Similar Papers
No similar papers found.
J
Jun Zhan
Fudan University
J
Junqi Dai
Fudan University
Jiasheng Ye
Jiasheng Ye
Fudan University
Large Language ModelsGenerative ModelsAI Scientists
Yunhua Zhou
Yunhua Zhou
Fudan University
Machine LearningNatural Language Processing
D
Dong Zhang
Fudan University
Z
Zhigeng Liu
Fudan University
X
Xin Zhang
Fudan University
Ruibin Yuan
Ruibin Yuan
HKUST
Artificial IntelligenceMusic GenerationMusic Information RetrievalComputer Music
G
Ge Zhang
Multimodal Art Projection Research Community
L
Linyang Li
Fudan University
H
Hang Yan
Shanghai AI Laboratory
J
Jie Fu
Multimodal Art Projection Research Community
T
Tao Gui
Fudan University
T
Tianxiang Sun
Fudan University
Yu-Gang Jiang
Yu-Gang Jiang
Professor, Fudan University. IEEE & IAPR Fellow
Video AnalysisEmbodied AITrustworthy AI
X
Xipeng Qiu
Fudan University