Entropy-Guided Token Dropout: Training Autoregressive Language Models with Limited Domain Data

๐Ÿ“… 2025-12-29
๐Ÿ“ˆ Citations: 0
โœจ Influential: 0
๐Ÿ“„ PDF
๐Ÿค– AI Summary
To address overfitting and degraded generalization of high-entropy tokens in autoregressive language models under few-shot domain data during multi-round fine-tuning, this paper proposes a structured regularization method dynamically governed by token-level information entropy. The core innovation is a novel entropy-guided, curriculum-style token dropout mechanismโ€”first to explicitly model the dynamic imbalance in token learning difficulty as a principled basis for regularization design, thereby aligning training dynamics with token-level uncertainty. The method integrates entropy estimation, dynamic mask sampling, and curriculum scheduling. Evaluated on models ranging from 0.6B to 8B parameters, it significantly improves training stability and generalization across multiple fine-tuning rounds, consistently outperforming baselines including Dropout and Label Smoothing. On low-resource domain tasks, it achieves an average accuracy gain of 3.2%.

Technology Category

Application Category

๐Ÿ“ Abstract
As access to high-quality, domain-specific data grows increasingly scarce, multi-epoch training has become a practical strategy for adapting large language models (LLMs). However, autoregressive models often suffer from performance degradation under repeated data exposure, where overfitting leads to a marked decline in model capability. Through empirical analysis, we trace this degradation to an imbalance in learning dynamics: predictable, low-entropy tokens are learned quickly and come to dominate optimization, while the model's ability to generalize on high-entropy tokens deteriorates with continued training. To address this, we introduce EntroDrop, an entropy-guided token dropout method that functions as structured data regularization. EntroDrop selectively masks low-entropy tokens during training and employs a curriculum schedule to adjust regularization strength in alignment with training progress. Experiments across model scales from 0.6B to 8B parameters show that EntroDrop consistently outperforms standard regularization baselines and maintains robust performance throughout extended multi-epoch training. These findings underscore the importance of aligning regularization with token-level learning dynamics when training on limited data. Our approach offers a promising pathway toward more effective adaptation of LLMs in data-constrained domains.
Problem

Research questions and friction points this paper is trying to address.

Addresses performance degradation in autoregressive models during multi-epoch training
Mitigates overfitting by balancing learning between low- and high-entropy tokens
Enables effective language model adaptation with limited domain-specific data
Innovation

Methods, ideas, or system contributions that make the work stand out.

Entropy-guided token dropout method for regularization
Selectively masks low-entropy tokens during training
Curriculum schedule adjusts regularization strength with progress
๐Ÿ”Ž Similar Papers
No similar papers found.
Jiapeng Wang
Jiapeng Wang
South China University of Technology
document understandingvisual information extractionmulti-modal learningCLIPLLM
Yiwen Hu
Yiwen Hu
University of Maryland, Baltimore County
Wireless networkNetwork securityMobile systems and applications
Y
Yanzipeng Gao
Gaoling School of Artificial Intelligence, Renmin University of China.
H
Haoyu Wang
Gaoling School of Artificial Intelligence, Renmin University of China.
S
Shuo Wang
Tsinghua University.
H
Hongyu Lu
WeChat, Tencent.
Jiaxin Mao
Jiaxin Mao
Renmin University of China
Information RetrievalUser Behavior AnalysisData Mining and Machine Learning
Wayne Xin Zhao
Wayne Xin Zhao
Professor, Renmin University of China
Recommender SystemNatural Language ProcessingLarge Language Model
J
Junyi Li
Department of Data Science, City University of Hong Kong.
X
Xiao Zhang
Gaoling School of Artificial Intelligence, Renmin University of China.