Mask-Enhanced Autoregressive Prediction: Pay Less Attention to Learn More

📅 2025-02-11
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Large language models (LLMs) exhibit limitations in critical information retrieval and long-range reasoning. To address this, we propose MEAP—a novel training paradigm that seamlessly integrates masked language modeling (MLM) with autoregressive next-token prediction (NTP) within a pure decoder architecture. Unlike conventional approaches, MEAP randomly masks a small subset of input tokens but retains standard autoregressive decoding—requiring no bidirectional attention, encoder-decoder structure, or additional computational overhead. This mechanism enhances the discriminability of attention scores, guiding the model to prioritize task-relevant signals. Experiments demonstrate that MEAP significantly improves critical information retrieval and long-context reasoning: under supervised fine-tuning with “middle token dropout,” it achieves a +11.77 percentage point accuracy gain. Crucially, MEAP preserves—or even surpasses—baseline performance on commonsense reasoning benchmarks, confirming its efficacy without compromising general reasoning capabilities.

Technology Category

Application Category

📝 Abstract
Large Language Models (LLMs) are discovered to suffer from accurately retrieving key information. To address this, we propose Mask-Enhanced Autoregressive Prediction (MEAP), a simple yet effective training paradigm that seamlessly integrates Masked Language Modeling (MLM) into Next-Token Prediction (NTP) to enhance the latter's in-context retrieval capabilities. Specifically, MEAP first randomly masks a small fraction of input tokens and then directly performs the standard next-token prediction autoregressive using a decoder-only Transformer. MEAP eliminates the need for bidirectional attention or encoder-decoder architectures for MLM, incurring no additional computational overhead during pre-training or inference. Intensive experiments demonstrate that MEAP substantially outperforms NTP on key information retrieval and long-context reasoning tasks, while performing on par or better on commonsense reasoning tasks. The benefits of MEAP also extend to supervised fine-tuning, where it shows remarkable advantages in lost-in-the-middle scenarios, outperforming NTP by 11.77 percentage points. Our analysis indicates that MEAP's effectiveness arises from its ability to promote more distinguishable attention scores by concentrating on a reduced set of non-masked tokens. This mechanism improves the model's focus on task-relevant signals while mitigating the influence of peripheral context. These findings position MEAP as a promising training paradigm for large language models.
Problem

Research questions and friction points this paper is trying to address.

Enhances key information retrieval in LLMs
Integrates MLM with NTP for better performance
Improves focus on task-relevant signals
Innovation

Methods, ideas, or system contributions that make the work stand out.

Integrates MLM into NTP
Uses decoder-only Transformer
Enhances key information retrieval
🔎 Similar Papers
No similar papers found.
Xialie Zhuang
Xialie Zhuang
Ubiquant
LLM
Z
Zhikai Jia
SCITIX (SGP) TECH PTE. LTD., Singapore
J
Jianjin Li
South China Normal University, China
Z
Zhenyu Zhang
University of Texas at Austin, USA
L
Li Shen
Sun Yat-Sen University, China
Z
Zheng Cao
SCITIX (SGP) TECH PTE. LTD., Singapore
S
Shiwei Liu
University of Oxford, UK