Dual Language Models: Balancing Training Efficiency and Overfitting Resilience

📅 2025-12-16

📈 Citations: 0

✨ Influential: 0

career value

172K/year

🤖 AI Summary

This paper addresses the challenge of simultaneously achieving high training efficiency and strong generalization—particularly robustness against overfitting—in language model training. We propose a dual-objective cooperative training paradigm that integrates autoregressive modeling with masked diffusion modeling, requiring no architectural modifications to the base model. A multi-objective weighted loss mechanism enables systematic evaluation, revealing that the optimal objective weighting remains remarkably stable across varying data repetition rates and downstream tasks. Extensive ablation studies across 50 models demonstrate that the dual-objective models consistently outperform single-objective baselines across all evaluation settings, substantially improving generalization while preserving training efficiency. To our knowledge, this is the first work to systematically establish the robustness and transferability advantages of jointly optimizing autoregressive and diffusion objectives in language model pretraining.

Technology Category

Application Category

📝 Abstract

This paper combines autoregressive and masked-diffusion training objectives without any architectural modifications, resulting in flexible language models that outperform single-objective models. Autoregressive modeling has been a popular approach, partly because of its training efficiency; however, that comes at the cost of sensitivity to overfitting. On the other hand, masked-diffusion models are less efficient to train while being more resilient to overfitting. In this work, we demonstrate that dual-objective training achieves the best of both worlds. To derive the optimal ratio between both objectives, we train and evaluate 50 language models under varying levels of data repetition. We show that it is optimal to combine both objectives under all evaluated settings and that the optimal ratio is similar whether targeting autoregressive or masked-diffusion downstream performance.

Problem

Research questions and friction points this paper is trying to address.

Balancing training efficiency and overfitting resilience in language models

Combining autoregressive and masked-diffusion objectives without architecture changes

Determining the optimal ratio between dual training objectives

Innovation

Methods, ideas, or system contributions that make the work stand out.

Combines autoregressive and masked-diffusion training objectives

Uses dual-objective training without architectural modifications

Determines optimal objective ratio via extensive model evaluation

🔎 Similar Papers

No similar papers found.