Layerwise Importance Analysis of Feed-Forward Networks in Transformer-based Language Models

📅 2025-08-25

📈 Citations: 0

✨ Influential: 0

career value

142K/year

🤖 AI Summary

This work investigates the inter-layer importance distribution of feed-forward networks (FFNs) in Transformer language models during pretraining. We propose a layer-wise importance analysis framework that, under fixed total parameter budgets, dynamically adjusts FFN hidden dimensions across layers or prunes FFN modules entirely. Extensive ablation experiments are conducted via from-scratch pretraining on models spanning 285M to 1.2B parameters and 12–40 layers. Results demonstrate that FFN importance is non-uniformly distributed: concentrating FFN capacity within a contiguous middle 70% of layers consistently yields superior multi-task downstream performance compared to conventional uniform allocation. This finding challenges the standard practice of assigning equal FFN capacity across all layers and provides empirical grounding for designing more parameter-efficient Transformer architectures.

Technology Category

Application Category

📝 Abstract

This study investigates the layerwise importance of feed-forward networks (FFNs) in Transformer-based language models during pretraining. We introduce an experimental approach that, while maintaining the total parameter count, increases the FFN dimensions in some layers and completely removes the FFNs from other layers. Furthermore, since our focus is on the importance of FFNs during pretraining, we train models from scratch to examine whether the importance of FFNs varies depending on their layer positions, rather than using publicly available pretrained models as is frequently done. Through comprehensive evaluations of models with varying sizes (285M, 570M, and 1.2B parameters) and layer counts (12, 24, and 40 layers), we demonstrate that concentrating FFNs in 70% of the consecutive middle layers consistently outperforms standard configurations for multiple downstream tasks.

Problem

Research questions and friction points this paper is trying to address.

Analyzing layerwise importance of FFNs in Transformer models

Investigating FFN distribution effects on pretraining efficiency

Optimizing FFN placement across model layers for performance

Innovation

Methods, ideas, or system contributions that make the work stand out.

Redistributes FFN layers while maintaining total parameters

Trains models from scratch to study pretraining importance

Concentrates FFNs in middle layers for optimal performance

🔎 Similar Papers

Talking Heads: Understanding Inter-layer Communication in Transformer Language Models