๐ค AI Summary
To address overfitting and degraded generalization of high-entropy tokens in autoregressive language models under few-shot domain data during multi-round fine-tuning, this paper proposes a structured regularization method dynamically governed by token-level information entropy. The core innovation is a novel entropy-guided, curriculum-style token dropout mechanismโfirst to explicitly model the dynamic imbalance in token learning difficulty as a principled basis for regularization design, thereby aligning training dynamics with token-level uncertainty. The method integrates entropy estimation, dynamic mask sampling, and curriculum scheduling. Evaluated on models ranging from 0.6B to 8B parameters, it significantly improves training stability and generalization across multiple fine-tuning rounds, consistently outperforming baselines including Dropout and Label Smoothing. On low-resource domain tasks, it achieves an average accuracy gain of 3.2%.
๐ Abstract
As access to high-quality, domain-specific data grows increasingly scarce, multi-epoch training has become a practical strategy for adapting large language models (LLMs). However, autoregressive models often suffer from performance degradation under repeated data exposure, where overfitting leads to a marked decline in model capability. Through empirical analysis, we trace this degradation to an imbalance in learning dynamics: predictable, low-entropy tokens are learned quickly and come to dominate optimization, while the model's ability to generalize on high-entropy tokens deteriorates with continued training. To address this, we introduce EntroDrop, an entropy-guided token dropout method that functions as structured data regularization. EntroDrop selectively masks low-entropy tokens during training and employs a curriculum schedule to adjust regularization strength in alignment with training progress. Experiments across model scales from 0.6B to 8B parameters show that EntroDrop consistently outperforms standard regularization baselines and maintains robust performance throughout extended multi-epoch training. These findings underscore the importance of aligning regularization with token-level learning dynamics when training on limited data. Our approach offers a promising pathway toward more effective adaptation of LLMs in data-constrained domains.