Every Activation Boosted: Scaling General Reasoner to 1 Trillion Open Language Foundation

📅 2025-10-24

📈 Citations: 0

✨ Influential: 0

career value

196K/year

🤖 AI Summary

This work addresses the challenges of inference efficiency and cross-scale consistency in large language models. We propose Ling 2.0—the first general-purpose, sparsely activated reasoning LLM supporting seamless scaling from billions to trillions of parameters. To this end, we design a highly sparse Mixture-of-Experts (MoE) architecture and introduce Model-Parallel-Throughput (MTP) parallelism, alongside cross-scale consistent modeling, reasoning-oriented data construction, Chain-of-Thought (CoT)-guided activation-based mid-training, Evolutionary CoT (Evo-CoT) reinforcement learning fine-tuning, full FP8 training, and fine-grained heterogeneous pipelining. At the trillion-parameter scale, Ling-1T establishes a new Pareto frontier in inference accuracy versus computational efficiency: it reduces activated FLOPs by up to 86% (i.e., 7× higher throughput) compared to dense baselines, significantly advancing the engineering deployment of efficient, scalable intelligent reasoning systems.

Technology Category

Application Category

📝 Abstract

We introduce Ling 2.0, a series reasoning-oriented language foundation built upon the principle that every activation boosts reasoning capability. Designed to scale from tens of billions to one trillion parameters under a unified Mixture-of-Experts (MoE) paradigm, Ling 2.0 emphasizes high sparsity, cross-scale consistency, and efficiency guided by empirical scaling laws. The series includes three non-thinking (instruct) models - Ling-mini-2.0, Ling-flash-2.0, and Ling-1T - ranging from 16B to 1T total parameters and achieving up to 7-fold active-compute efficiency compared with dense counterparts. Ling 2.0 integrates coordinated innovations across model architecture, pre-training, post-training, and infrastructure: a high-sparsity MoE with MTP for efficient reasoning, reasoning-oriented data and mid-training CoT activation, reinforcement-based fine-tuning (DFT, Evo-CoT), and full-scale FP8 training with fine-grained heterogeneous pipelines. At the trillion scale, Ling-1T establishes a new Pareto frontier of reasoning accuracy versus computational efficiency, demonstrating that sparse activation, when properly aligned with reasoning objectives, enables scalable and efficient intelligence. Collectively, Ling 2.0 provides a coherent, open, and efficient foundation for advancing future reasoning and thinking models, including the Ring series built upon the same base.

Problem

Research questions and friction points this paper is trying to address.

Scaling sparse MoE language models to trillion parameters for reasoning

Achieving computational efficiency through high sparsity and activation optimization

Establishing new Pareto frontier for reasoning accuracy versus efficiency

Innovation

Methods, ideas, or system contributions that make the work stand out.

High-sparsity Mixture-of-Experts with MTP for efficient reasoning

Reasoning-oriented data and mid-training CoT activation

Full-scale FP8 training with fine-grained heterogeneous pipelines

🔎 Similar Papers

No similar papers found.