Entropy-aware Masking for Masked Language Modeling

📅 2026-05-27

📈 Citations: 0

✨ Influential: 0

career value

188K/year

🤖 AI Summary

This work addresses the inefficiency of conventional random masking strategies in language model pretraining, which often fail to identify tokens most valuable for learning. The authors propose a dynamic masking approach that prioritizes tokens with high information content and prediction uncertainty, as measured by the model’s own predictive entropy. Notably, this method introduces a self-masking mechanism that operates without requiring an external reference model. By further integrating knowledge distillation into the training process, the approach significantly enhances both training efficiency and downstream performance. Evaluated on the GLUE benchmark, the proposed method achieves an average performance gain of 5% over the baseline, with the combined use of dynamic masking and knowledge distillation yielding state-of-the-art overall results.

📝 Abstract

Masked language modeling has become a standard pretraining objective for training encoder-based language models. In this approach, certain tokens in the input are masked, and the model learns to predict them using the surrounding context. This process enables the model to capture both syntactic and semantic properties of language. Conventionally, the tokens selected for masking are chosen at random, which may not always yield the most effective learning signals. In this work, we examine a token masking strategy based on entropy distribution. We use the model's entropy over token predictions to identify which tokens should be masked. This method aims to target tokens that are more informative and uncertain to improve the training efficacy. We also propose a novel self-masking approach that enhances training efficiency without relying on an external reference model. Experimental results demonstrate that our method achieves an average performance improvement of 5% in GLUE scores compared to the baseline. Further, we experiment with combining knowledge distillation with entropy masking, resulting in the best overall results.

Problem

Research questions and friction points this paper is trying to address.

masked language modeling

token masking

entropy

pretraining

learning signal

Innovation

Methods, ideas, or system contributions that make the work stand out.

entropy-aware masking

masked language modeling

self-masking