Mixture-of-Depths Attention

📅 2026-03-16
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the degradation of shallow-layer information in deeply scaled large language models, where repeated residual updates impede effective recovery in deeper layers. To mitigate this, we propose the Mixture-of-Depth Attention (MoDA) mechanism, which introduces cross-layer key-value fusion for the first time, enabling each attention head to jointly attend to key-value pairs from both the current and preceding layers. MoDA is integrated with a post-normalization architecture to enhance representational capacity. Furthermore, we design a memory-efficient algorithm compatible with FlashAttention-2, substantially reducing non-contiguous memory overhead. Experiments on a 1.5B-parameter model show that MoDA reduces average perplexity by 0.2 and improves downstream task performance by 2.11%, with only a 3.7% increase in FLOPs and achieving 97.3% of FlashAttention-2’s computational efficiency, offering a new paradigm for deep model scaling.

Technology Category

Application Category

📝 Abstract
Scaling depth is a key driver for large language models (LLMs). Yet, as LLMs become deeper, they often suffer from signal degradation: informative features formed in shallow layers are gradually diluted by repeated residual updates, making them harder to recover in deeper layers. We introduce mixture-of-depths attention (MoDA), a mechanism that allows each attention head to attend to sequence KV pairs at the current layer and depth KV pairs from preceding layers. We further describe a hardware-efficient algorithm for MoDA that resolves non-contiguous memory-access patterns, achieving 97.3% of FlashAttention-2's efficiency at a sequence length of 64K. Experiments on 1.5B-parameter models demonstrate that MoDA consistently outperforms strong baselines. Notably, it improves average perplexity by 0.2 across 10 validation benchmarks and increases average performance by 2.11% on 10 downstream tasks, with a negligible 3.7% FLOPs computational overhead. We also find that combining MoDA with post-norm yields better performance than using it with pre-norm. These results suggest that MoDA is a promising primitive for depth scaling. Code is released at https://github.com/hustvl/MoDA .
Problem

Research questions and friction points this paper is trying to address.

signal degradation
depth scaling
large language models
residual updates
feature dilution
Innovation

Methods, ideas, or system contributions that make the work stand out.

Mixture-of-Depths Attention
depth scaling
signal degradation
hardware-efficient attention
residual feature preservation
🔎 Similar Papers
No similar papers found.
L
Lianghui Zhu
School of EIC, Huazhong University of Science & Technology; ByteDance Seed
Y
Yuxin Fang
ByteDance Seed
B
Bencheng Liao
School of EIC, Huazhong University of Science & Technology; ByteDance Seed
S
Shijie Wang
ByteDance Seed
Tianheng Cheng
Tianheng Cheng
ByteDance Seed
Computer VisionObject DetectionInstance SegmentationMultimodal ModelsAutonomous Driving
Zilong Huang
Zilong Huang
ByteDance Inc.
Multi-modal LearningComputer Vision
C
Chen Chen
ByteDance Seed
L
Lai Wei
ByteDance Seed
Y
Yutao Zeng
ByteDance Seed
Y
Ya Wang
ByteDance Seed
Y
Yi Lin
ByteDance Seed
Y
Yu Li
ByteDance Seed
Xinggang Wang
Xinggang Wang
Professor, Huazhong University of Science and Technology
Artificial IntelligenceComputer VisionAutonomous DrivingObject DetectionObject Segmentation