🤖 AI Summary
This work identifies a semantic corruption problem in masked language modeling (MLM) induced by the [MASK] token: its forced replacement exacerbates contextual ambiguity and increases representation multimodality, thereby degrading downstream task performance. To address this, we propose ExLM, a context-enhanced MLM framework whose core innovation is explicit expansion modeling of the [MASK] state—integrating multi-granularity dependency modeling and semantic consistency regularization to strengthen contextual awareness. Experiments demonstrate that ExLM significantly outperforms strong baselines—including BERT and RoBERTa—on both text understanding and molecular SMILES modeling tasks. It effectively mitigates semantic ambiguity, reduces representation uncertainty, and improves accuracy and robustness across diverse downstream applications.
📝 Abstract
Masked Language Models (MLMs) have achieved remarkable success in many self-supervised representation learning tasks. MLMs are trained by randomly replacing some tokens in the input sentences with $ exttt{[MASK]}$ tokens and predicting the original tokens based on the remaining context. This paper explores the impact of $ exttt{[MASK]}$ tokens on MLMs. Analytical studies show that masking tokens can introduce the corrupted semantics problem, wherein the corrupted context may convey multiple, ambiguous meanings. This problem is also a key factor affecting the performance of MLMs on downstream tasks. Based on these findings, we propose a novel enhanced-context MLM, ExLM. Our approach expands $ exttt{[MASK]}$ tokens in the input context and models the dependencies between these expanded states. This expansion increases context capacity and enables the model to capture richer semantic information, effectively mitigating the corrupted semantics problem during pre-training. Experimental results demonstrate that ExLM achieves significant performance improvements in both text modeling and SMILES modeling tasks. Further analysis confirms that ExLM enhances semantic representations through context enhancement, and effectively reduces the multimodality problem commonly observed in MLMs.