🤖 AI Summary
Large language models (LLMs) exhibit limitations in critical information retrieval and long-range reasoning. To address this, we propose MEAP—a novel training paradigm that seamlessly integrates masked language modeling (MLM) with autoregressive next-token prediction (NTP) within a pure decoder architecture. Unlike conventional approaches, MEAP randomly masks a small subset of input tokens but retains standard autoregressive decoding—requiring no bidirectional attention, encoder-decoder structure, or additional computational overhead. This mechanism enhances the discriminability of attention scores, guiding the model to prioritize task-relevant signals. Experiments demonstrate that MEAP significantly improves critical information retrieval and long-context reasoning: under supervised fine-tuning with “middle token dropout,” it achieves a +11.77 percentage point accuracy gain. Crucially, MEAP preserves—or even surpasses—baseline performance on commonsense reasoning benchmarks, confirming its efficacy without compromising general reasoning capabilities.
📝 Abstract
Large Language Models (LLMs) are discovered to suffer from accurately retrieving key information. To address this, we propose Mask-Enhanced Autoregressive Prediction (MEAP), a simple yet effective training paradigm that seamlessly integrates Masked Language Modeling (MLM) into Next-Token Prediction (NTP) to enhance the latter's in-context retrieval capabilities. Specifically, MEAP first randomly masks a small fraction of input tokens and then directly performs the standard next-token prediction autoregressive using a decoder-only Transformer. MEAP eliminates the need for bidirectional attention or encoder-decoder architectures for MLM, incurring no additional computational overhead during pre-training or inference. Intensive experiments demonstrate that MEAP substantially outperforms NTP on key information retrieval and long-context reasoning tasks, while performing on par or better on commonsense reasoning tasks. The benefits of MEAP also extend to supervised fine-tuning, where it shows remarkable advantages in lost-in-the-middle scenarios, outperforming NTP by 11.77 percentage points. Our analysis indicates that MEAP's effectiveness arises from its ability to promote more distinguishable attention scores by concentrating on a reduced set of non-masked tokens. This mechanism improves the model's focus on task-relevant signals while mitigating the influence of peripheral context. These findings position MEAP as a promising training paradigm for large language models.