JoyAI-LLM Flash: Advancing Mid-Scale LLMs with Token Efficiency

📅 2026-04-03

📈 Citations: 0

✨ Influential: 0

career value

218K/year

🤖 AI Summary

This work addresses the challenge of balancing performance and token efficiency in large language models with fewer than 50 billion parameters by introducing an efficient sparse mixture-of-experts (MoE) architecture. The proposed model activates only 2.7 billion parameters out of a total of 48 billion and integrates multi-token prediction (MTP) with quantization-aware training (QAT) through a co-design strategy. Leveraging supervised fine-tuning (SFT), direct preference optimization (DPO), and a newly introduced reinforcement learning algorithm, FiberPO—which enables stable multi-scale policy optimization to balance cognitive modes—the model achieves state-of-the-art performance while maintaining low activated parameter counts. This approach significantly enhances inference throughput and improves token efficiency. Both base and post-trained versions of the model are publicly released.

Technology Category

Application Category

📝 Abstract

We introduce JoyAI-LLM Flash, an efficient Mixture-of-Experts (MoE) language model designed to redefine the trade-off between strong performance and token efficiency in the sub-50B parameter regime. JoyAI-LLM Flash is pretrained on a massive corpus of 20 trillion tokens and further optimized through a rigorous post-training pipeline, including supervised fine-tuning (SFT), Direct Preference Optimization (DPO), and large-scale reinforcement learning (RL) across diverse environments. To improve token efficiency, JoyAI-LLM Flash strategically balances \emph{thinking} and \emph{non-thinking} cognitive modes and introduces FiberPO, a novel RL algorithm inspired by fibration theory that decomposes trust-region maintenance into global and local components, providing unified multi-scale stability control for LLM policy optimization. To enhance architectural sparsity, the model comprises 48B total parameters while activating only 2.7B parameters per forward pass, achieving a substantially higher sparsity ratio than contemporary industry leading models of comparable scale. To further improve inference throughput, we adopt a joint training-inference co-design that incorporates dense Multi-Token Prediction (MTP) and Quantization-Aware Training (QAT). We release the checkpoints for both JoyAI-LLM-48B-A3B Base and its post-trained variants on Hugging Face to support the open-source community.

Problem

Research questions and friction points this paper is trying to address.

token efficiency

Mixture-of-Experts

large language models

inference throughput

model sparsity

Innovation

Methods, ideas, or system contributions that make the work stand out.

Mixture-of-Experts

Token Efficiency

FiberPO