Revisiting the Shape Convention of Transformer Language Models

📅 2026-02-06
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work challenges the long-standing “narrow–wide–narrow” architecture of the feed-forward network (FFN) module in Transformers and proposes a novel “wide–narrow–wide” hourglass-shaped FFN design based on residual connections. By reallocating parameters between the attention and FFN modules under a fixed parameter budget—specifically, reducing FFN parameters while increasing attention dimensions—the approach introduces, for the first time, an hourglass multilayer perceptron (MLP) structure into Transformer FFNs, thereby breaking from conventional design paradigms. Experimental results demonstrate that the proposed architecture significantly outperforms standard FFNs at a 400-million-parameter scale and achieves comparable performance at 1 billion parameters, consistently enhancing model efficacy within the same computational budget.

Technology Category

Application Category

📝 Abstract
Dense Transformer language models have largely adhered to one consistent architectural shape: each layer consists of an attention module followed by a feed-forward network (FFN) with a narrow-wide-narrow MLP, allocating most parameters to the MLP at expansion ratios between 2 and 4. Motivated by recent results that residual wide-narrow-wide (hourglass) MLPs offer superior function approximation capabilities, we revisit the long-standing MLP shape convention in Transformer, challenging the necessity of the narrow-wide-narrow design. To study this, we develop a Transformer variant that replaces the conventional FFN with a deeper hourglass-shaped FFN, comprising a stack of hourglass sub-MLPs connected by residual pathways. We posit that a deeper but lighter hourglass FFN can serve as a competitive alternative to the conventional FFN, and that parameters saved by using a lighter hourglass FFN can be more effectively utilized, such as by enlarging model hidden dimensions under fixed budgets. We confirm these through empirical validations across model scales: hourglass FFNs outperform conventional FFNs up to 400M and achieve comparable performance at larger scales to 1B parameters; hourglass FFN variants with reduced FFN and increased attention parameters show consistent improvements over conventional configurations at matched budgets. Together, these findings shed new light on recent work and prompt a rethinking of the narrow-wide-narrow MLP convention and the balance between attention and FFN towards efficient and expressive modern language models.
Problem

Research questions and friction points this paper is trying to address.

Transformer
Feed-Forward Network
MLP shape
model architecture
parameter allocation
Innovation

Methods, ideas, or system contributions that make the work stand out.

hourglass MLP
Transformer architecture
feed-forward network
parameter efficiency
model scaling
🔎 Similar Papers
No similar papers found.
Feng-Ting Liao
Feng-Ting Liao
MediaTek Research
M
Meng-Hsi Chen
MediaTek Research
G
Guan-Ting Yi
MediaTek Research, National Taiwan University
Da-shan Shiu
Da-shan Shiu
MediaTek
Cellularwirelessneural networksautonomous cars