Most Transformer Modifications Still Do Not Transfer at 1-3B: A 2020-2026 Update to Narang et al. (2021) with Downstream Evaluation and a Noise Floor

📅 2026-05-20

📈 Citations: 0

✨ Influential: 0

career value

160K/year

🤖 AI Summary

This study addresses the uncertainty surrounding whether recent architectural modifications to Transformers genuinely improve downstream performance at the 1–3B parameter scale. For the first time, it systematically evaluates 20 architecture variants introduced since 2021 under rigorous controlled conditions—employing iso-data, iso-compute, and iso-recipe experimental designs—alongside multiple random seeds, Bonferroni correction for statistical significance, and the CLIMB-12 downstream benchmark suite. The findings reveal that only two modifications yield statistically significant gains on a 1.2B model, with one failing to train stably at 3B scale. Notably, several methods achieve pretraining losses comparable to the baseline yet suffer downstream performance drops of 6–16 CLIMB points, highlighting a substantial disconnect between pretraining loss and downstream effectiveness and underscoring the critical importance of cross-scale stability.

📝 Abstract

Narang et al. (2021) evaluated 40+ Transformer modifications at T5-base scale and concluded that most did not transfer. Five years later, the typical working regime has moved to 1-3B parameters, downstream evaluation has replaced pretraining perplexity, and a substantially different catalogue of modifications has emerged. We revisit their question by testing 20 post-2021 Transformer modifications at 1.2B and 3B under strict iso-data, iso-compute, iso-recipe control, with a multi-seed baseline noise floor and CLIMB-12 downstream evaluation as the primary metric. The central finding reproduces theirs at this curated set: most modifications do not transfer. Of the 20 modifications, only two clear Bonferroni correction at 1.2B; one of those two further fails to train stably at 3B under the shared recipe. We also find that the loss-downstream gap reported by Tay et al. (2023) enlarges several-fold for attention-output modifications: two significant failures converge to within 2-3% of baseline validation loss yet drop 6-16 CLIMB-points. We conclude that noise-floor reporting, downstream evaluation, and cross-scale stability testing are now prerequisites for architecture comparisons at 1-3B.

Problem

Research questions and friction points this paper is trying to address.

Transformer modifications

transfer learning

downstream evaluation

model scale

architecture comparison

Innovation

Methods, ideas, or system contributions that make the work stand out.

downstream evaluation

noise floor

iso-recipe control