🤖 AI Summary
This work addresses the challenge of balancing performance and token efficiency in large language models with fewer than 50 billion parameters by introducing an efficient sparse mixture-of-experts (MoE) architecture. The proposed model activates only 2.7 billion parameters out of a total of 48 billion and integrates multi-token prediction (MTP) with quantization-aware training (QAT) through a co-design strategy. Leveraging supervised fine-tuning (SFT), direct preference optimization (DPO), and a newly introduced reinforcement learning algorithm, FiberPO—which enables stable multi-scale policy optimization to balance cognitive modes—the model achieves state-of-the-art performance while maintaining low activated parameter counts. This approach significantly enhances inference throughput and improves token efficiency. Both base and post-trained versions of the model are publicly released.
📝 Abstract
We introduce JoyAI-LLM Flash, an efficient Mixture-of-Experts (MoE) language model designed to redefine the trade-off between strong performance and token efficiency in the sub-50B parameter regime. JoyAI-LLM Flash is pretrained on a massive corpus of 20 trillion tokens and further optimized through a rigorous post-training pipeline, including supervised fine-tuning (SFT), Direct Preference Optimization (DPO), and large-scale reinforcement learning (RL) across diverse environments. To improve token efficiency, JoyAI-LLM Flash strategically balances \emph{thinking} and \emph{non-thinking} cognitive modes and introduces FiberPO, a novel RL algorithm inspired by fibration theory that decomposes trust-region maintenance into global and local components, providing unified multi-scale stability control for LLM policy optimization. To enhance architectural sparsity, the model comprises 48B total parameters while activating only 2.7B parameters per forward pass, achieving a substantially higher sparsity ratio than contemporary industry leading models of comparable scale. To further improve inference throughput, we adopt a joint training-inference co-design that incorporates dense Multi-Token Prediction (MTP) and Quantization-Aware Training (QAT). We release the checkpoints for both JoyAI-LLM-48B-A3B Base and its post-trained variants on Hugging Face to support the open-source community.