Next Token Prediction Towards Multimodal Intelligence: A Comprehensive Survey

📅 2024-12-16
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This paper addresses the lack of a unified modeling paradigm in multimodal intelligence by proposing a next-generation multimodal learning framework centered on the unified objective of Next-Token Prediction (NTP). Methodologically, it introduces the first five-dimensional NTP taxonomy—encompassing multimodal tokenization, model architecture, task representation, datasets, and evaluation benchmarks—and integrates cross-modal tokenization, sequence-based modeling, unified prompt engineering, and multimodal benchmark construction. Key contributions include: (1) shifting multimodal learning from task-specific paradigms toward a unified objective-driven framework; (2) releasing the first open-source repository (on GitHub) dedicated to multimodal NTP, including curated literature and reproducible code; and (3) providing a systematic theoretical foundation and practical, reproducible guidelines to advance research on multimodal large language models.

Technology Category

Application Category

📝 Abstract
Building on the foundations of language modeling in natural language processing, Next Token Prediction (NTP) has evolved into a versatile training objective for machine learning tasks across various modalities, achieving considerable success. As Large Language Models (LLMs) have advanced to unify understanding and generation tasks within the textual modality, recent research has shown that tasks from different modalities can also be effectively encapsulated within the NTP framework, transforming the multimodal information into tokens and predict the next one given the context. This survey introduces a comprehensive taxonomy that unifies both understanding and generation within multimodal learning through the lens of NTP. The proposed taxonomy covers five key aspects: Multimodal tokenization, MMNTP model architectures, unified task representation, datasets &evaluation, and open challenges. This new taxonomy aims to aid researchers in their exploration of multimodal intelligence. An associated GitHub repository collecting the latest papers and repos is available at https://github.com/LMM101/Awesome-Multimodal-Next-Token-Prediction
Problem

Research questions and friction points this paper is trying to address.

Next Word Prediction
Multimodal Information
Artificial Intelligence
Innovation

Methods, ideas, or system contributions that make the work stand out.

Next Token Prediction
Large Language Models
Multi-modal Intelligence
🔎 Similar Papers
No similar papers found.
L
Liang Chen
Peking University
Z
Zekun Wang
Beihang University
Shuhuai Ren
Shuhuai Ren
Peking University
Deep LearningNatural Language Processing
L
Lei Li
University of Hong Kong
H
Haozhe Zhao
Peking University
Yunshui Li
Yunshui Li
Seed Team | Prev. Qwen, SIAT
Natural Language ProcessingMultimodal (Vision-and-Language) Representation Learning
Zefan Cai
Zefan Cai
Student, Peking University
Inference AccelerationMulti-Modality
Hongcheng Guo
Hongcheng Guo
School of Data Science, Fudan University
LLMsMultimodal LLMs
L
Lei Zhang
Shenzhen Institute of Advanced Technology, China Academy of Sciences
Yizhe Xiong
Yizhe Xiong
Tsinghua University
Transfer LearningComputer VisionLarge Language Models
Y
Yichi Zhang
Peking University
R
Ruoyu Wu
Peking University
Qingxiu Dong
Qingxiu Dong
Peking University
Natural Language ProcessingMachine Learning
G
Ge Zhang
M-A-P
J
Jian Yang
Alibaba Group
Lingwei Meng
Lingwei Meng
ByteDance; The Chinese University of Hong Kong
Speech and Language ProcessingSpeech RecognitionSpeech Synthesis
Shujie Hu
Shujie Hu
The Chinese University of Hong Kong
Speech ProcessingMLLM
Yulong Chen
Yulong Chen
MD Anderson Cancer Center
Lung CancerCancer biology
Junyang Lin
Junyang Lin
Qwen Team, Alibaba Group & Peking University
Natural Language ProcessingCross-Modal Representation LearningPretraining
Shuai Bai
Shuai Bai
Qwen Team, Alibaba Group
Multi-Modal LearningVisual Generation
A
Andreas Vlachos
University of Cambridge
X
Xu Tan
Microsoft Research
Minjia Zhang
Minjia Zhang
University of Illinois at Urbana-Champagin
ParallelismMachine Learning SystemsModel CompressionLLM Application
W
Wen Xiao
Microsoft Research
A
Aaron Yee
Humanify Inc., Zhejiang University
T
Tianyu Liu
Alibaba Group
B
Baobao Chang
Peking University