UltraMemV2: Memory Networks Scaling to 120B Parameters with Superior Long-Context Learning

📅 2025-08-26
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address high memory-access overhead in MoE model inference and the inability of existing memory-augmented architectures (e.g., UltraMem) to support 8-expert configurations, this paper proposes UltraMemV2—a redesigned memory-augmented architecture. Its core innovations include embedding lightweight memory modules within each Transformer block, adopting FFN-based value processing, simplifying value expansion, employing fine-grained parameter initialization, and dynamically balancing computational resource allocation between memory modules and FFNs. This design enables a 120B-parameter model to activate only 2.5B parameters, substantially reducing memory access. Experiments demonstrate that UltraMemV2 matches the performance of an 8-expert MoE model under equivalent computational and parameter budgets, while improving long-context memory retention, multi-turn memory capability, and in-context learning by 1.6, 6.2, and 7.9 percentage points, respectively.

Technology Category

Application Category

📝 Abstract
While Mixture of Experts (MoE) models achieve remarkable efficiency by activating only subsets of parameters, they suffer from high memory access costs during inference. Memory-layer architectures offer an appealing alternative with very few memory access, but previous attempts like UltraMem have only matched the performance of 2-expert MoE models, falling significantly short of state-of-the-art 8-expert configurations. We present UltraMemV2, a redesigned memory-layer architecture that closes this performance gap. Our approach introduces five key improvements: integrating memory layers into every transformer block, simplifying value expansion with single linear projections, adopting FFN-based value processing from PEER, implementing principled parameter initialization, and rebalancing memory-to-FFN computation ratios. Through extensive evaluation, we demonstrate that UltraMemV2 achieves performance parity with 8-expert MoE models under same computation and parameters but significantly low memory access. Notably, UltraMemV2 shows superior performance on memory-intensive tasks, with improvements of +1.6 points on long-context memorization, +6.2 points on multi-round memorization, and +7.9 points on in-context learning. We validate our approach at scale with models up to 2.5B activated parameters from 120B total parameters, and establish that activation density has greater impact on performance than total sparse parameter count. Our work brings memory-layer architectures to performance parity with state-of-the-art MoE models, presenting a compelling alternative for efficient sparse computation.
Problem

Research questions and friction points this paper is trying to address.

Reducing high memory access costs in Mixture of Experts models
Closing performance gap between memory-layer and 8-expert MoE models
Achieving efficient sparse computation with superior long-context learning
Innovation

Methods, ideas, or system contributions that make the work stand out.

Integrates memory layers into every transformer block
Simplifies value expansion with single linear projections
Adopts FFN-based value processing from PEER
🔎 Similar Papers
No similar papers found.
Z
Zihao Huang
ByteDance Seed
Y
Yu Bao
ByteDance Seed
Qiyang Min
Qiyang Min
ByteDance
Large Language Model
S
Siyan Chen
ByteDance Seed
R
Ran Guo
ByteDance Seed
H
Hongzhi Huang
ByteDance Seed
Defa Zhu
Defa Zhu
ByteDance
AGI
Y
Yutao Zeng
ByteDance Seed
Banggu Wu
Banggu Wu
ByteDance
Large Language Models
Xun Zhou
Xun Zhou
Professor of Computer Science, Harbin Institute of Technology, Shenzhen (HIT-SZ)
Big data analyticsSpatial databaseSpatial Data MiningGISmachine learning
S
Siyuan Qiao
ByteDance Seed