Layerwise Importance Analysis of Feed-Forward Networks in Transformer-based Language Models

📅 2025-08-25
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work investigates the inter-layer importance distribution of feed-forward networks (FFNs) in Transformer language models during pretraining. We propose a layer-wise importance analysis framework that, under fixed total parameter budgets, dynamically adjusts FFN hidden dimensions across layers or prunes FFN modules entirely. Extensive ablation experiments are conducted via from-scratch pretraining on models spanning 285M to 1.2B parameters and 12–40 layers. Results demonstrate that FFN importance is non-uniformly distributed: concentrating FFN capacity within a contiguous middle 70% of layers consistently yields superior multi-task downstream performance compared to conventional uniform allocation. This finding challenges the standard practice of assigning equal FFN capacity across all layers and provides empirical grounding for designing more parameter-efficient Transformer architectures.

Technology Category

Application Category

📝 Abstract
This study investigates the layerwise importance of feed-forward networks (FFNs) in Transformer-based language models during pretraining. We introduce an experimental approach that, while maintaining the total parameter count, increases the FFN dimensions in some layers and completely removes the FFNs from other layers. Furthermore, since our focus is on the importance of FFNs during pretraining, we train models from scratch to examine whether the importance of FFNs varies depending on their layer positions, rather than using publicly available pretrained models as is frequently done. Through comprehensive evaluations of models with varying sizes (285M, 570M, and 1.2B parameters) and layer counts (12, 24, and 40 layers), we demonstrate that concentrating FFNs in 70% of the consecutive middle layers consistently outperforms standard configurations for multiple downstream tasks.
Problem

Research questions and friction points this paper is trying to address.

Analyzing layerwise importance of FFNs in Transformer models
Investigating FFN distribution effects on pretraining efficiency
Optimizing FFN placement across model layers for performance
Innovation

Methods, ideas, or system contributions that make the work stand out.

Redistributes FFN layers while maintaining total parameters
Trains models from scratch to study pretraining importance
Concentrates FFNs in middle layers for optimal performance
🔎 Similar Papers
No similar papers found.