JoyAI-LLM Flash: Advancing Mid-Scale LLMs with Token Efficiency

📅 2026-04-03
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the challenge of balancing performance and token efficiency in large language models with fewer than 50 billion parameters by introducing an efficient sparse mixture-of-experts (MoE) architecture. The proposed model activates only 2.7 billion parameters out of a total of 48 billion and integrates multi-token prediction (MTP) with quantization-aware training (QAT) through a co-design strategy. Leveraging supervised fine-tuning (SFT), direct preference optimization (DPO), and a newly introduced reinforcement learning algorithm, FiberPO—which enables stable multi-scale policy optimization to balance cognitive modes—the model achieves state-of-the-art performance while maintaining low activated parameter counts. This approach significantly enhances inference throughput and improves token efficiency. Both base and post-trained versions of the model are publicly released.
📝 Abstract
We introduce JoyAI-LLM Flash, an efficient Mixture-of-Experts (MoE) language model designed to redefine the trade-off between strong performance and token efficiency in the sub-50B parameter regime. JoyAI-LLM Flash is pretrained on a massive corpus of 20 trillion tokens and further optimized through a rigorous post-training pipeline, including supervised fine-tuning (SFT), Direct Preference Optimization (DPO), and large-scale reinforcement learning (RL) across diverse environments. To improve token efficiency, JoyAI-LLM Flash strategically balances \emph{thinking} and \emph{non-thinking} cognitive modes and introduces FiberPO, a novel RL algorithm inspired by fibration theory that decomposes trust-region maintenance into global and local components, providing unified multi-scale stability control for LLM policy optimization. To enhance architectural sparsity, the model comprises 48B total parameters while activating only 2.7B parameters per forward pass, achieving a substantially higher sparsity ratio than contemporary industry leading models of comparable scale. To further improve inference throughput, we adopt a joint training-inference co-design that incorporates dense Multi-Token Prediction (MTP) and Quantization-Aware Training (QAT). We release the checkpoints for both JoyAI-LLM-48B-A3B Base and its post-trained variants on Hugging Face to support the open-source community.
Problem

Research questions and friction points this paper is trying to address.

token efficiency
Mixture-of-Experts
large language models
inference throughput
model sparsity
Innovation

Methods, ideas, or system contributions that make the work stand out.

Mixture-of-Experts
Token Efficiency
FiberPO
Quantization-Aware Training
Multi-Token Prediction
🔎 Similar Papers
No similar papers found.
A
Aichen Cai
A
Anmeng Zhang
A
Anyu Li
B
Bo Zhang
B
Bohua Cai
C
Chang Li
C
Changjian Jiang
C
Changkai Lu
Chao Xue
Chao Xue
Beihang University
Natural Language ProcessingLarge Language Model
C
Chaocai Liang
C
Cheng Zhang
D
Dongkai Liu
F
Fei Wang
G
Guoqiang Huang
H
Haijian Ke
H
Han Lin
H
Hao Wang
J
Ji Miao
J
Jiacheng Zhang
J
Jialong Shi
J
Jifeng Zhu
J
Jingjing Qian
J
Junhui Luo
J
Junwu Xiong
L
Lam So