GLM-4.1V-Thinking: Towards Versatile Multimodal Reasoning with Scalable Reinforcement Learning

šŸ“… 2025-07-01
šŸ“ˆ Citations: 0
✨ Influential: 0
šŸ“„ PDF
šŸ¤– AI Summary
Existing vision-language models (VLMs) lack unified, strong multimodal reasoning capabilities across diverse tasks—including STEM problem-solving, video understanding, code generation, GUI interaction, and long-document comprehension—particularly at small parameter scales. Method: We propose a novel ā€œcurriculum-sampling reinforcement learningā€ training paradigm that progressively schedules tasks and optimizes policies to effectively unlock the reasoning potential of visual foundation models. Contribution/Results: Our GLM-4.1V-9B-Thinking model achieves state-of-the-art performance among open-source models of comparable size on 28 public benchmarks; matches or surpasses the 72B-parameter Qwen2.5-VL-72B on 18 benchmarks; and attains GPT-4o-level performance on STEM reasoning and long-document understanding—marking the first demonstration of robust cross-modal reasoning in a sub-10B-parameter VLM.

Technology Category

Application Category

šŸ“ Abstract
We present GLM-4.1V-Thinking, a vision-language model (VLM) designed to advance general-purpose multimodal reasoning. In this report, we share our key findings in the development of the reasoning-centric training framework. We first develop a capable vision foundation model with significant potential through large-scale pre-training, which arguably sets the upper bound for the final performance. Reinforcement Learning with Curriculum Sampling (RLCS) then unlocks the full potential of the model, leading to comprehensive capability enhancement across a diverse range of tasks, including STEM problem solving, video understanding, content recognition, coding, grounding, GUI-based agents, and long document understanding, among others. To facilitate research in this field, we open-source GLM-4.1V-9B-Thinking, which achieves state-of-the-art performance among models of comparable size. In a comprehensive evaluation across 28 public benchmarks, our model outperforms Qwen2.5-VL-7B on nearly all tasks and achieves comparable or even superior performance on 18 benchmarks relative to the significantly larger Qwen2.5-VL-72B. Notably, GLM-4.1V-9B-Thinking also demonstrates competitive or superior performance compared to closed-source models such as GPT-4o on challenging tasks including long document understanding and STEM reasoning, further underscoring its strong capabilities. Code, models and more information are released at https://github.com/THUDM/GLM-4.1V-Thinking.
Problem

Research questions and friction points this paper is trying to address.

Advancing general-purpose multimodal reasoning with a vision-language model
Enhancing diverse task performance via scalable reinforcement learning
Achieving state-of-the-art results in STEM, video, and document understanding
Innovation

Methods, ideas, or system contributions that make the work stand out.

Large-scale pre-trained vision foundation model
Reinforcement Learning with Curriculum Sampling
State-of-the-art multimodal reasoning performance
šŸ”Ž Similar Papers
No similar papers found.
Wenyi Hong
Wenyi Hong
Tsinghua University
multimodal pretraining
Wenmeng Yu
Wenmeng Yu
Tsinghua University
Natural Language ProcessingMultimodal LearningFacial Expression Recognition
Xiaotao Gu
Xiaotao Gu
Zhipu AI
Language ModelingGenerative ModelsData Mining
G
Guo Wang
Guobing Gan
Guobing Gan
Unknown affiliation
H
Haomiao Tang
J
Jiale Cheng
J
Ji Qi
J
Junhui Ji
L
Lihang Pan
S
Shuaiqi Duan
W
Weihan Wang
Y
Yan Wang
Y
Yean Cheng
Z
Zehai He
Z
Zhe Su
Z
Zhen Yang
Z
Ziyang Pan
Aohan Zeng
Aohan Zeng
Tsinghua University
Large Language ModelsNatural Language Processing
B
Baoxu Wang
B
Boyan Shi
C
Changyu Pang
C
Chenhui Zhang
Da Yin
Da Yin
Meta FAIR
Natural Language Processing
F
Fan Yang