ENTP: Encoder-only Next Token Prediction

📅 2024-10-02

🏛️ arXiv.org

📈 Citations: 2

✨ Influential: 0

career value

138K/year

🤖 AI Summary

This work challenges the prevailing assumption that decoder-only Transformers are optimal for next-word prediction (NWP) under unconstrained computational resources, proposing an encoder-only next-word prediction (ENTP) paradigm. Methodologically, it removes the causal masking from standard encoders to enable full self-attention and systematically evaluates ENTP via theoretical analysis and controlled synthetic tasks (e.g., Count3), as well as downstream benchmarks—including arithmetic reasoning, in-context learning, and language modeling. Key contributions include: (i) the first demonstration that a pure encoder architecture perfectly solves the Count3 task—a problem on which decoder-only models provably fail to generalize; (ii) joint theoretical and empirical evidence establishing ENTP’s fundamental advantages in expressive power and computational complexity over decoder-only counterparts; and (iii) consistent, significant performance gains over same-scale decoder-only baselines across multiple tasks, thereby questioning the field’s entrenched reliance on decoder-only architectures for NWP.

Technology Category

Application Category

📝 Abstract

Next-token prediction is conventionally done using decoder-only Transformers with causal attention, as this approach allows for efficient reuse of keys and values. What if we were not compute-limited, should we still use decoder-only Transformers? In this work, we introduce Encoder-only Next Token Prediction (ENTP). We explore the differences between ENTP and decoder-only Transformers in expressive power and complexity, highlighting potential advantages of ENTP in settings with unbounded compute. We introduce the $operatorname{Count3}$ task and show, both theoretically and experimentally, that while ENTP can perform this task easily, a decoder-only Transformer cannot. Finally, we empirically demonstrate the superior performance of ENTP across representative tasks where next-token prediction based Transformers can be evaluated, including addition, in-context learning, and language modeling.

Problem

Research questions and friction points this paper is trying to address.

Encoder-only Next Token Prediction

Comparison with decoder-only Transformers

Performance in unbounded compute settings

Innovation

Methods, ideas, or system contributions that make the work stand out.

Encoder-only Next Token Prediction

Unbounded compute settings

Superior performance in tasks

🔎 Similar Papers

No similar papers found.