MoL for LLMs: Dual-Loss Optimization to Enhance Domain Expertise While Preserving General Capabilities

📅 2025-05-17

📈 Citations: 0

✨ Influential: 0

career value

198K/year

🤖 AI Summary

Large language models (LLMs) suffer from hallucination and low accuracy on domain-specific tasks, while conventional continual pretraining often degrades general-purpose capabilities. To address this trade-off, we propose the Mixture-of-Losses (MoL) framework, which decouples optimization objectives for domain-specific and general corpora: cross-entropy loss is applied to domain data to strengthen domain expertise, while KL divergence regularization is imposed on general-domain data to preserve the original model’s output distribution. We introduce a novel dual-loss co-updating mechanism and empirically find that a 1:1 ratio of domain-to-general data achieves optimal balance—without requiring hyperparameter tuning. On Math-500 (non-chain-of-thought), accuracy improves by 27.9%; on AIME25 (chain-of-thought), it rises by 83.3%. Crucially, no degradation in general capabilities is observed—significantly outperforming standard continual pretraining approaches.

Technology Category

Application Category

📝 Abstract

Although LLMs perform well in general tasks, domain-specific applications suffer from hallucinations and accuracy limitations. CPT approaches encounter two key issues: (1) domain-biased data degrades general language skills, and (2) improper corpus-mixture ratios limit effective adaptation. To address these, we propose a novel framework, Mixture of Losses (MoL), which decouples optimization objectives for domain-specific and general corpora. Specifically, cross-entropy (CE) loss is applied to domain data to ensure knowledge acquisition, while Kullback-Leibler (KL) divergence aligns general-corpus training with the base model's foundational capabilities. This dual-loss architecture preserves universal skills while enhancing domain expertise, avoiding catastrophic forgetting. Empirically, we validate that a 1:1 domain-to-general corpus ratio optimally balances training and overfitting without the need for extensive tuning or resource-intensive experiments. Furthermore, our experiments demonstrate significant performance gains compared to traditional CPT approaches, which often suffer from degradation in general language capabilities; our model achieves 27.9% higher accuracy on the Math-500 benchmark in the non-think reasoning mode, and an impressive 83.3% improvement on the challenging AIME25 subset in the think mode, underscoring the effectiveness of our approach.

Problem

Research questions and friction points this paper is trying to address.

Enhance domain expertise in LLMs while preserving general capabilities

Address domain-biased data degrading general language skills

Optimize corpus-mixture ratios for effective domain adaptation

Innovation

Methods, ideas, or system contributions that make the work stand out.

Dual-loss optimization for domain and general skills

Cross-entropy loss ensures domain knowledge acquisition

KL divergence preserves general language capabilities

🔎 Similar Papers

No similar papers found.