ENTP: Encoder-only Next Token Prediction

📅 2024-10-02
🏛️ arXiv.org
📈 Citations: 2
Influential: 0
📄 PDF
🤖 AI Summary
This work challenges the prevailing assumption that decoder-only Transformers are optimal for next-word prediction (NWP) under unconstrained computational resources, proposing an encoder-only next-word prediction (ENTP) paradigm. Methodologically, it removes the causal masking from standard encoders to enable full self-attention and systematically evaluates ENTP via theoretical analysis and controlled synthetic tasks (e.g., Count3), as well as downstream benchmarks—including arithmetic reasoning, in-context learning, and language modeling. Key contributions include: (i) the first demonstration that a pure encoder architecture perfectly solves the Count3 task—a problem on which decoder-only models provably fail to generalize; (ii) joint theoretical and empirical evidence establishing ENTP’s fundamental advantages in expressive power and computational complexity over decoder-only counterparts; and (iii) consistent, significant performance gains over same-scale decoder-only baselines across multiple tasks, thereby questioning the field’s entrenched reliance on decoder-only architectures for NWP.

Technology Category

Application Category

📝 Abstract
Next-token prediction is conventionally done using decoder-only Transformers with causal attention, as this approach allows for efficient reuse of keys and values. What if we were not compute-limited, should we still use decoder-only Transformers? In this work, we introduce Encoder-only Next Token Prediction (ENTP). We explore the differences between ENTP and decoder-only Transformers in expressive power and complexity, highlighting potential advantages of ENTP in settings with unbounded compute. We introduce the $operatorname{Count3}$ task and show, both theoretically and experimentally, that while ENTP can perform this task easily, a decoder-only Transformer cannot. Finally, we empirically demonstrate the superior performance of ENTP across representative tasks where next-token prediction based Transformers can be evaluated, including addition, in-context learning, and language modeling.
Problem

Research questions and friction points this paper is trying to address.

Encoder-only Next Token Prediction
Comparison with decoder-only Transformers
Performance in unbounded compute settings
Innovation

Methods, ideas, or system contributions that make the work stand out.

Encoder-only Next Token Prediction
Unbounded compute settings
Superior performance in tasks
🔎 Similar Papers
No similar papers found.
E
Ethan Ewer
Department of Electrical and Computer Engineering, University of Wisconsin-Madison
Daewon Chae
Daewon Chae
University of Michigan
Generative ModelDeep Learning
T
Thomas Zeng
Department of Computer Science, University of Wisconsin-Madison
J
Jinkyu Kim
Department of Computer Science, Korea University
Kangwook Lee
Kangwook Lee
University of Wisconsin-Madison, KRAFTON AI
Machine LearningInformation Theory