GLM-5V-Turbo: Toward a Native Foundation Model for Multimodal Agents

📅 2026-04-29
📈 Citations: 0
Influential: 0
📄 PDF

career value

242K/year
🤖 AI Summary
Current foundation models lack native capabilities for perceiving and acting upon multimodal contexts such as images, videos, and web pages, hindering the development of truly multimodal agents. This work proposes a native foundation model architecture that treats multimodal perception as a core component—rather than a peripheral interface—by deeply integrating perception into the entire pipeline of reasoning, planning, tool use, and execution. Through joint optimization via multimodal training, reinforcement learning, and agent frameworks, the model adopts an end-to-end hierarchical design. It demonstrates exceptional performance in multimodal programming, visual tool invocation, and structured task execution, while preserving strong text-only capabilities, thereby offering a systematic paradigm for building multimodal agents.
📝 Abstract
We present GLM-5V-Turbo, a step toward native foundation models for multimodal agents. As foundation models are increasingly deployed in real environments, agentic capability depends not only on language reasoning, but also on the ability to perceive, interpret, and act over heterogeneous contexts such as images, videos, webpages, documents, GUIs. GLM-5V-Turbo is built around this objective: multimodal perception is integrated as a core component of reasoning, planning, tool use, and execution, rather than as an auxiliary interface to a language model. This report summarizes the main improvements behind GLM-5V-Turbo across model design, multimodal training, reinforcement learning, toolchain expansion, and integration with agent frameworks. These developments lead to strong performance in multimodal coding, visual tool use, and framework-based agentic tasks, while preserving competitive text-only coding capability. More importantly, our development process offers practical insights for building multimodal agents, highlighting the central role of multimodal perception, hierarchical optimization, and reliable end-to-end verification.
Problem

Research questions and friction points this paper is trying to address.

multimodal agents
foundation model
multimodal perception
agentic capability
heterogeneous contexts
Innovation

Methods, ideas, or system contributions that make the work stand out.

multimodal agents
native foundation model
multimodal perception
tool use
hierarchical optimization
Wenyi Hong
Wenyi Hong
Tsinghua University
multimodal pretraining
Xiaotao Gu
Xiaotao Gu
Zhipu AI
Language ModelingGenerative ModelsData Mining
Z
Ziyang Pan
Z.ai & Tsinghua University
Zhen Yang
Zhen Yang
Tsinghua University
Large Language ModelGraph Representation LearningNegative SamplingRecommendation
Y
Yuting Wang
Z.ai & Tsinghua University
Y
Yue Wang
Z.ai & Tsinghua University
Y
Yuanchang Yue
Z.ai & Tsinghua University
Yu Wang
Yu Wang
Shanghai Jiao Tong University & Shanghai AI Laboratory
Natural Language ProcessingSpeech and Language ProcessingLarge Language Model
Yanling Wang
Yanling Wang
Zhipu AI
Data MiningNatural Language Processing
Yan Wang
Yan Wang
Tsinghua university; SenseTime
Neural CompressionComputer VisionMachine Learning
X
Xijun Liu
Z.ai & Tsinghua University
Wenmeng Yu
Wenmeng Yu
Tsinghua University
Natural Language ProcessingMultimodal LearningFacial Expression Recognition
Weihan Wang
Weihan Wang
Z.ai
Multimodal learningLLM
Wei Li
Wei Li
Northeastern University | MIT | Tsinghua
Applied mechanicsbattery crash safetyscientific machine learning
S
Shuaiqi Duan
Z.ai & Tsinghua University
S
Sheng Yang
Z.ai & Tsinghua University
R
Ruiliang Lv
Z.ai & Tsinghua University
Mingdao Liu
Mingdao Liu
PhD Student, Tsinghua University
Natural Language Processing
L
Lihang Pan
Z.ai & Tsinghua University
K
Ke Ning
Z.ai & Tsinghua University
J
Junhui Ji
Z.ai & Tsinghua University
J
Jinjiang Wang
Z.ai & Tsinghua University
Jing Chen
Jing Chen
Tsinghua University; Algorand Inc; Stony Brook University; IAS; MIT
Game theory and mechanism designdistributed ledgerstheory of computation