Revisiting Transformers through the Lens of Low Entropy and Dynamic Sparsity

📅 2025-04-26
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Evaluating Transformer-based data compression is hindered by unknown target distributions and computationally expensive entropy estimation. Method: We propose a theoretically grounded information-theoretic framework for controlled experiments, integrating dynamic activation tracking, module-level sparsity quantification, and dead-neuron monitoring. Contribution/Results: We systematically reveal that Transformers possess an intrinsic low-entropy inductive bias—amplified with scale—and primarily driven by feed-forward network (FFN) modules. Dynamic sparsity mechanisms—including residual skipping and neuron death—correlate strongly with training stability and redundancy reduction depth. Larger models exhibit significantly lower active neuron ratios and stronger low-entropy compression tendencies. Critically, we identify abrupt neuron death as a key precursor to training collapse. These findings establish a novel theoretical foundation for understanding the fundamental compression mechanisms of Transformers.

Technology Category

Application Category

📝 Abstract
Compression has been a critical lens to understand the success of Transformers. In the past, we have typically taken the target distribution as a criterion to evaluate a model's compression performance. Nevertheless,it often remains challenging to precisely assess how well the model achieves compression and to compare the information content of the learned distribution with that of the target distribution during compression,as the target distribution is typically unknown and entropy computation often incurs exponential cost. In this work, we explore these issues under a controlled experimental setup. We find that Transformers exhibit a unique inductive bias in data compression: beyond approaching the target distribution, they tend to favor learning lower-entropy distributions, with this tendency becoming more pronounced as the model size increases. This preference prevents Transformers from perfectly aligning with the target distribution, instead further compressing its information content. Furthermore, we show that the FFN module plays a critical role in driving this bias. In addition, while models remove informational redundancy from data during compression, they also exhibit redundancy within their parameters, which enables compression and can be characterized through dynamic sparsity. However, the dynamic sparsity patterns in Transformers, particularly in attention and FFN modules, demand further exploration. As for this, we show that larger Transformers show stronger preferences for bypassing attention computations via residual connections and have lower proportion of active neurons. Interestingly, we also find that training instability in larger models strongly correlates with sudden increases in dead neurons. Our work contributes to a deeper understanding of Transformers from the lens of entropy and dynamic sparsity.
Problem

Research questions and friction points this paper is trying to address.

Assessing Transformer compression performance with unknown target distributions
Exploring Transformers' bias toward low-entropy distributions during learning
Investigating dynamic sparsity patterns in attention and FFN modules
Innovation

Methods, ideas, or system contributions that make the work stand out.

Transformers favor learning lower-entropy distributions
FFN module drives low-entropy inductive bias
Dynamic sparsity enables parameter redundancy compression
🔎 Similar Papers
No similar papers found.
Ruifeng Ren
Ruifeng Ren
Renmin University of China
Machine learningLLMs
Y
Yong Liu
Gaoling School of Artificial Intelligence, Renmin University of China, Beijing, China