ExLM: Rethinking the Impact of $ exttt{[MASK]}$ Tokens in Masked Language Models

📅 2025-01-23
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work identifies a semantic corruption problem in masked language modeling (MLM) induced by the [MASK] token: its forced replacement exacerbates contextual ambiguity and increases representation multimodality, thereby degrading downstream task performance. To address this, we propose ExLM, a context-enhanced MLM framework whose core innovation is explicit expansion modeling of the [MASK] state—integrating multi-granularity dependency modeling and semantic consistency regularization to strengthen contextual awareness. Experiments demonstrate that ExLM significantly outperforms strong baselines—including BERT and RoBERTa—on both text understanding and molecular SMILES modeling tasks. It effectively mitigates semantic ambiguity, reduces representation uncertainty, and improves accuracy and robustness across diverse downstream applications.

Technology Category

Application Category

📝 Abstract
Masked Language Models (MLMs) have achieved remarkable success in many self-supervised representation learning tasks. MLMs are trained by randomly replacing some tokens in the input sentences with $ exttt{[MASK]}$ tokens and predicting the original tokens based on the remaining context. This paper explores the impact of $ exttt{[MASK]}$ tokens on MLMs. Analytical studies show that masking tokens can introduce the corrupted semantics problem, wherein the corrupted context may convey multiple, ambiguous meanings. This problem is also a key factor affecting the performance of MLMs on downstream tasks. Based on these findings, we propose a novel enhanced-context MLM, ExLM. Our approach expands $ exttt{[MASK]}$ tokens in the input context and models the dependencies between these expanded states. This expansion increases context capacity and enables the model to capture richer semantic information, effectively mitigating the corrupted semantics problem during pre-training. Experimental results demonstrate that ExLM achieves significant performance improvements in both text modeling and SMILES modeling tasks. Further analysis confirms that ExLM enhances semantic representations through context enhancement, and effectively reduces the multimodality problem commonly observed in MLMs.
Problem

Research questions and friction points this paper is trying to address.

Masked Language Models
Polysemy
Semantic Ambiguity
Innovation

Methods, ideas, or system contributions that make the work stand out.

ExLM Model
Mask Token Enhancement
Contextual Information Utilization
🔎 Similar Papers
No similar papers found.
Kangjie Zheng
Kangjie Zheng
Wellcome Sanger Institute
AI4ScienceNLPLarge Language Model
Junwei Yang
Junwei Yang
Peking University
Natural Language ProcessingGraph Neural NetworkAi4Science
S
Siyue Liang
School of Computer Science, National Key Laboratory for Multimedia Information Processing, Peking University-Anker Embodied AI Lab, Peking University
B
Bin Feng
International Digital Economy Academy (IDEA), Shenzhen, China
Zequn Liu
Zequn Liu
Microsoft Research AI4Science, Asia
W
Wei Ju
College of Computer Science, Sichuan University, Chengdu, China
Zhiping Xiao
Zhiping Xiao
Postdoc at University of Washington
CSEDMML
M
Ming Zhang
School of Computer Science, National Key Laboratory for Multimedia Information Processing, Peking University-Anker Embodied AI Lab, Peking University