🤖 AI Summary
This work challenges the prevailing assumption that decoder-only Transformers are optimal for next-word prediction (NWP) under unconstrained computational resources, proposing an encoder-only next-word prediction (ENTP) paradigm. Methodologically, it removes the causal masking from standard encoders to enable full self-attention and systematically evaluates ENTP via theoretical analysis and controlled synthetic tasks (e.g., Count3), as well as downstream benchmarks—including arithmetic reasoning, in-context learning, and language modeling. Key contributions include: (i) the first demonstration that a pure encoder architecture perfectly solves the Count3 task—a problem on which decoder-only models provably fail to generalize; (ii) joint theoretical and empirical evidence establishing ENTP’s fundamental advantages in expressive power and computational complexity over decoder-only counterparts; and (iii) consistent, significant performance gains over same-scale decoder-only baselines across multiple tasks, thereby questioning the field’s entrenched reliance on decoder-only architectures for NWP.
📝 Abstract
Next-token prediction is conventionally done using decoder-only Transformers with causal attention, as this approach allows for efficient reuse of keys and values. What if we were not compute-limited, should we still use decoder-only Transformers? In this work, we introduce Encoder-only Next Token Prediction (ENTP). We explore the differences between ENTP and decoder-only Transformers in expressive power and complexity, highlighting potential advantages of ENTP in settings with unbounded compute. We introduce the $operatorname{Count3}$ task and show, both theoretically and experimentally, that while ENTP can perform this task easily, a decoder-only Transformer cannot. Finally, we empirically demonstrate the superior performance of ENTP across representative tasks where next-token prediction based Transformers can be evaluated, including addition, in-context learning, and language modeling.