Dense2MoE: Pushing the Pareto Frontier of On-Device LLMs via Unified Pruning and Upcycling

📅 2026-05-25
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Deploying large language models (LLMs) on edge devices faces the longstanding trade-off between model compression and accuracy: conventional pruning degrades performance, existing upcycling methods introduce redundancy, and training sparse Mixture-of-Experts (MoE) models from scratch incurs prohibitive costs. This work proposes Dense2MoE, a novel framework that unifies pruning and upcycling for the first time. Guided by hardware Roofline models, Dense2MoE employs Layer Fusion UpCycling (LF-UC) to prune redundant attention modules while repurposing their MLPs as MoE experts, coupled with selective token routing to control activated parameters. Requiring only lightweight continued pretraining, Dense2MoE simultaneously enhances inference efficiency and model accuracy without substantial training overhead. Experiments demonstrate that MoE models derived from various public dense LLMs via Dense2MoE consistently outperform dense baselines, state-of-the-art compression techniques, and standard upcycling approaches, establishing a new Pareto frontier in latency–accuracy trade-offs for on-device deployment.
📝 Abstract
The Mixture of Experts MoE architecture is highly promising for resource constrained on device deployments yet training these models from scratch incurs prohibitive costs Current methods attempt to alleviate this by upcycling dense models into MoEs however they often introduce parameter redundancy that degrades inference efficiency Alternatively standard layer pruning mitigates redundancy but inevitably compromises model accuracy To resolve this dilemma we propose Dense2MoE a novel framework that unifies pruning and upcycling through Layer Fusion UpCycling LF UC Guided by hardware Roofline theory Dense2MoE systematically overcomes the inference memory wall by pruning bandwidth heavy attention modules from redundant layers while repurposing their Multi Layer Perceptrons MLPs into MoE experts This structural innovation preserves the models core capabilities and strictly limits active parameters via selective token routing With a modest continual pre training budget Dense2MoE efficiently converts publicly available dense LLMs into on device ready MoE models Extensive experiments demonstrate that Dense2MoE significantly advances the Pareto frontier for on device inference latency versus model accuracy outperforming dense baselines state of the art compression and standard upcycling methods
Problem

Research questions and friction points this paper is trying to address.

on-device LLMs
Mixture of Experts
model pruning
parameter redundancy
inference efficiency
Innovation

Methods, ideas, or system contributions that make the work stand out.

Mixture of Experts
Model Pruning
Upcycling
On-Device LLMs
Layer Fusion