🤖 AI Summary
This work proposes MoVE, a novel mechanism that decouples parametric memory from network architecture by introducing it as an independent, scalable dimension. Unlike conventional autoregressive models—where memory capacity is inherently tied to depth or width, necessitating increased computational cost for knowledge expansion—MoVE employs a shared, global learnable value embedding bank dynamically fused with retrieved content via soft gating. This design enables flexible scaling of memory capacity without significantly increasing computational overhead. Empirical evaluations demonstrate that MoVE consistently outperforms both standard and layer-wise memory baselines across text and image generation tasks, achieving lower perplexity and higher generation fidelity under identical computational budgets.
📝 Abstract
Autoregressive sequence modeling stands as the cornerstone of modern Generative AI, powering results across diverse modalities ranging from text generation to image generation. However, a fundamental limitation of this paradigm is the rigid structural coupling of model capacity to computational cost: expanding a model's parametric memory -- its repository of factual knowledge or visual patterns -- traditionally requires deepening or widening the network, which incurs a proportional rise in active FLOPs. In this work, we introduce $\textbf{MoVE (Mixture of Value Embeddings)}$, a mechanism that breaks this coupling and establishes a new axis for scaling capacity. MoVE decouples memory from compute by introducing a global bank of learnable value embeddings shared across all attention layers. For every step in the sequence, the model employs a differentiable soft gating mechanism to dynamically mix retrieved concepts from this bank into the standard value projection. This architecture allows parametric memory to be scaled independently of network depth by simply increasing the number of embedding slots. We validate MoVE through strictly controlled experiments on two representative applications of autoregressive modeling: Text Generation and Image Generation. In both domains, MoVE yields consistent performance improvements over standard and layer-wise memory baselines, enabling the construction of"memory-dense"models that achieve lower perplexity and higher fidelity than their dense counterparts at comparable compute budgets.