🤖 AI Summary
Large language models (LLMs) suffer from hallucination and low accuracy on domain-specific tasks, while conventional continual pretraining often degrades general-purpose capabilities. To address this trade-off, we propose the Mixture-of-Losses (MoL) framework, which decouples optimization objectives for domain-specific and general corpora: cross-entropy loss is applied to domain data to strengthen domain expertise, while KL divergence regularization is imposed on general-domain data to preserve the original model’s output distribution. We introduce a novel dual-loss co-updating mechanism and empirically find that a 1:1 ratio of domain-to-general data achieves optimal balance—without requiring hyperparameter tuning. On Math-500 (non-chain-of-thought), accuracy improves by 27.9%; on AIME25 (chain-of-thought), it rises by 83.3%. Crucially, no degradation in general capabilities is observed—significantly outperforming standard continual pretraining approaches.
📝 Abstract
Although LLMs perform well in general tasks, domain-specific applications suffer from hallucinations and accuracy limitations. CPT approaches encounter two key issues: (1) domain-biased data degrades general language skills, and (2) improper corpus-mixture ratios limit effective adaptation. To address these, we propose a novel framework, Mixture of Losses (MoL), which decouples optimization objectives for domain-specific and general corpora. Specifically, cross-entropy (CE) loss is applied to domain data to ensure knowledge acquisition, while Kullback-Leibler (KL) divergence aligns general-corpus training with the base model's foundational capabilities. This dual-loss architecture preserves universal skills while enhancing domain expertise, avoiding catastrophic forgetting. Empirically, we validate that a 1:1 domain-to-general corpus ratio optimally balances training and overfitting without the need for extensive tuning or resource-intensive experiments. Furthermore, our experiments demonstrate significant performance gains compared to traditional CPT approaches, which often suffer from degradation in general language capabilities; our model achieves 27.9% higher accuracy on the Math-500 benchmark in the non-think reasoning mode, and an impressive 83.3% improvement on the challenging AIME25 subset in the think mode, underscoring the effectiveness of our approach.