Every Activation Boosted: Scaling General Reasoner to 1 Trillion Open Language Foundation

πŸ“… 2025-10-24
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
This work addresses the challenges of inference efficiency and cross-scale consistency in large language models. We propose Ling 2.0β€”the first general-purpose, sparsely activated reasoning LLM supporting seamless scaling from billions to trillions of parameters. To this end, we design a highly sparse Mixture-of-Experts (MoE) architecture and introduce Model-Parallel-Throughput (MTP) parallelism, alongside cross-scale consistent modeling, reasoning-oriented data construction, Chain-of-Thought (CoT)-guided activation-based mid-training, Evolutionary CoT (Evo-CoT) reinforcement learning fine-tuning, full FP8 training, and fine-grained heterogeneous pipelining. At the trillion-parameter scale, Ling-1T establishes a new Pareto frontier in inference accuracy versus computational efficiency: it reduces activated FLOPs by up to 86% (i.e., 7Γ— higher throughput) compared to dense baselines, significantly advancing the engineering deployment of efficient, scalable intelligent reasoning systems.

Technology Category

Application Category

πŸ“ Abstract
We introduce Ling 2.0, a series reasoning-oriented language foundation built upon the principle that every activation boosts reasoning capability. Designed to scale from tens of billions to one trillion parameters under a unified Mixture-of-Experts (MoE) paradigm, Ling 2.0 emphasizes high sparsity, cross-scale consistency, and efficiency guided by empirical scaling laws. The series includes three non-thinking (instruct) models - Ling-mini-2.0, Ling-flash-2.0, and Ling-1T - ranging from 16B to 1T total parameters and achieving up to 7-fold active-compute efficiency compared with dense counterparts. Ling 2.0 integrates coordinated innovations across model architecture, pre-training, post-training, and infrastructure: a high-sparsity MoE with MTP for efficient reasoning, reasoning-oriented data and mid-training CoT activation, reinforcement-based fine-tuning (DFT, Evo-CoT), and full-scale FP8 training with fine-grained heterogeneous pipelines. At the trillion scale, Ling-1T establishes a new Pareto frontier of reasoning accuracy versus computational efficiency, demonstrating that sparse activation, when properly aligned with reasoning objectives, enables scalable and efficient intelligence. Collectively, Ling 2.0 provides a coherent, open, and efficient foundation for advancing future reasoning and thinking models, including the Ring series built upon the same base.
Problem

Research questions and friction points this paper is trying to address.

Scaling sparse MoE language models to trillion parameters for reasoning
Achieving computational efficiency through high sparsity and activation optimization
Establishing new Pareto frontier for reasoning accuracy versus efficiency
Innovation

Methods, ideas, or system contributions that make the work stand out.

High-sparsity Mixture-of-Experts with MTP for efficient reasoning
Reasoning-oriented data and mid-training CoT activation
Full-scale FP8 training with fine-grained heterogeneous pipelines
πŸ”Ž Similar Papers
No similar papers found.
L
Ling Team
Inclusion AI
A
Ang Li
Inclusion AI
B
Ben Liu
Inclusion AI
Binbin Hu
Binbin Hu
BUPT & Ant Group
Deep LearningData MiningGraph EmbeddingRecommender System
B
Bing Li
Inclusion AI
B
Bingwei Zeng
Inclusion AI
Borui Ye
Borui Ye
Inclusion AI
C
Caizhi Tang
Inclusion AI
Changxin Tian
Changxin Tian
Renmin University of China & Ant Group
Large Language Models
C
Chao Huang
Inclusion AI
C
Chao Zhang
Inclusion AI
C
Chen Qian
Inclusion AI
C
Chenchen Ju
Inclusion AI
C
Chenchen Li
Inclusion AI
C
Chengfu Tang
Inclusion AI
C
Chili Fu
Inclusion AI
C
Chunshao Ren
Inclusion AI
C
Chunwei Wu
Inclusion AI
C
Cong Zhang
Inclusion AI
C
Cunyin Peng
Inclusion AI
D
Dafeng Xu
Inclusion AI
Daixin Wang
Daixin Wang
Tsinghua University
D
Dalong Zhang
Inclusion AI
D
Dingnan Jin
Inclusion AI
D
Dingyuan Zhu
Inclusion AI