Finedeep: Mitigating Sparse Activation in Dense LLMs via Multi-Layer Fine-Grained Experts

📅 2025-02-18

📈 Citations: 0

✨ Influential: 0

career value

164K/year

🤖 AI Summary

Large language models suffer from sparse activation, leading to suboptimal utilization of the representation space. To address this, we propose the Hierarchical Fine-grained Expert Architecture (HFEA), which departs from conventional single-layer Mixture-of-Experts (MoE) designs by decoupling feed-forward networks into lightweight, fine-grained experts distributed across multiple sub-layers. We introduce a learnable sparse routing mechanism for dynamic expert assignment. Crucially, we establish the “depth-width co-optimization” principle—jointly maximizing activation density and representational capacity under fixed parameter and FLOPs budgets. Through fine-grained expert partitioning, multi-layer stacking, and neural architecture search, HFEA achieves consistent perplexity reductions across model scales and outperforms strong baselines on major benchmarks. Under identical computational budgets, it significantly enhances representation utilization efficiency.

Technology Category

Application Category

📝 Abstract

Large language models have demonstrated exceptional performance across a wide range of tasks. However, dense models usually suffer from sparse activation, where many activation values tend towards zero (i.e., being inactivated). We argue that this could restrict the efficient exploration of model representation space. To mitigate this issue, we propose Finedeep, a deep-layered fine-grained expert architecture for dense models. Our framework partitions the feed-forward neural network layers of traditional dense models into small experts, arranges them across multiple sub-layers. A novel routing mechanism is proposed to determine each expert's contribution. We conduct extensive experiments across various model sizes, demonstrating that our approach significantly outperforms traditional dense architectures in terms of perplexity and benchmark performance while maintaining a comparable number of parameters and floating-point operations. Moreover, we find that Finedeep achieves optimal results when balancing depth and width, specifically by adjusting the number of expert sub-layers and the number of experts per sub-layer. Empirical results confirm that Finedeep effectively alleviates sparse activation and efficiently utilizes representation capacity in dense models.

Problem

Research questions and friction points this paper is trying to address.

Mitigates sparse activation in dense LLMs

Enhances model representation space exploration

Optimizes depth-width balance for expert sub-layers

Innovation

Methods, ideas, or system contributions that make the work stand out.

Multi-layer fine-grained expert architecture

Novel routing mechanism for experts

Balancing depth and width optimization

🔎 Similar Papers

Balancing Speciality and Versatility: A Coarse to Fine Framework for Mitigating Catastrophic Forgetting in Large Language Models