MoL for LLMs: Dual-Loss Optimization to Enhance Domain Expertise While Preserving General Capabilities

📅 2025-05-17
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Large language models (LLMs) suffer from hallucination and low accuracy on domain-specific tasks, while conventional continual pretraining often degrades general-purpose capabilities. To address this trade-off, we propose the Mixture-of-Losses (MoL) framework, which decouples optimization objectives for domain-specific and general corpora: cross-entropy loss is applied to domain data to strengthen domain expertise, while KL divergence regularization is imposed on general-domain data to preserve the original model’s output distribution. We introduce a novel dual-loss co-updating mechanism and empirically find that a 1:1 ratio of domain-to-general data achieves optimal balance—without requiring hyperparameter tuning. On Math-500 (non-chain-of-thought), accuracy improves by 27.9%; on AIME25 (chain-of-thought), it rises by 83.3%. Crucially, no degradation in general capabilities is observed—significantly outperforming standard continual pretraining approaches.

Technology Category

Application Category

📝 Abstract
Although LLMs perform well in general tasks, domain-specific applications suffer from hallucinations and accuracy limitations. CPT approaches encounter two key issues: (1) domain-biased data degrades general language skills, and (2) improper corpus-mixture ratios limit effective adaptation. To address these, we propose a novel framework, Mixture of Losses (MoL), which decouples optimization objectives for domain-specific and general corpora. Specifically, cross-entropy (CE) loss is applied to domain data to ensure knowledge acquisition, while Kullback-Leibler (KL) divergence aligns general-corpus training with the base model's foundational capabilities. This dual-loss architecture preserves universal skills while enhancing domain expertise, avoiding catastrophic forgetting. Empirically, we validate that a 1:1 domain-to-general corpus ratio optimally balances training and overfitting without the need for extensive tuning or resource-intensive experiments. Furthermore, our experiments demonstrate significant performance gains compared to traditional CPT approaches, which often suffer from degradation in general language capabilities; our model achieves 27.9% higher accuracy on the Math-500 benchmark in the non-think reasoning mode, and an impressive 83.3% improvement on the challenging AIME25 subset in the think mode, underscoring the effectiveness of our approach.
Problem

Research questions and friction points this paper is trying to address.

Enhance domain expertise in LLMs while preserving general capabilities
Address domain-biased data degrading general language skills
Optimize corpus-mixture ratios for effective domain adaptation
Innovation

Methods, ideas, or system contributions that make the work stand out.

Dual-loss optimization for domain and general skills
Cross-entropy loss ensures domain knowledge acquisition
KL divergence preserves general language capabilities
🔎 Similar Papers
No similar papers found.
J
Jingxue Chen
Wired Product Operation Division, ZTE Corporation, Nanjing, China
Q
Qingkun Tang
Wired Product Operation Division, ZTE Corporation, Nanjing, China
Q
Qianchun Lu
Wired Product Operation Division, ZTE Corporation, Nanjing, China
Siyuan Fang
Siyuan Fang
Beijing University of Posts and Telecommunications
artificial intelligence