Virtual Width Networks

📅 2025-11-14

📈 Citations: 0

✨ Influential: 0

career value

188K/year

🤖 AI Summary

How can neural network representational capacity and optimization efficiency be improved without incurring quadratic computational overhead from increasing hidden-layer width? This paper proposes Virtual Width Networks (VWNs), the first approach to decouple *representational width* from *backbone width*: instead of enlarging hidden layers, VWNs expand the embedding space while preserving near-constant backbone computation. Theoretically and empirically, we establish an approximate log-linear scaling relationship between virtual width and training loss reduction—introducing a novel dimension for large-model efficiency optimization. The mechanism enables rapid convergence under multi-fold width expansion. Extensive experiments demonstrate that an 8× virtual width acceleration yields over 2× speedup for next-token prediction and 3× for two-token prediction, with accelerating gains intensifying throughout training.

Technology Category

Application Category

📝 Abstract

We introduce Virtual Width Networks (VWN), a framework that delivers the benefits of wider representations without incurring the quadratic cost of increasing the hidden size. VWN decouples representational width from backbone width, expanding the embedding space while keeping backbone compute nearly constant. In our large-scale experiment, an 8-times expansion accelerates optimization by over 2 times for next-token and 3 times for next-2-token prediction. The advantage amplifies over training as both the loss gap grows and the convergence-speedup ratio increases, showing that VWN is not only token-efficient but also increasingly effective with scale. Moreover, we identify an approximately log-linear scaling relation between virtual width and loss reduction, offering an initial empirical basis and motivation for exploring virtual-width scaling as a new dimension of large-model efficiency.

Problem

Research questions and friction points this paper is trying to address.

Reduces computational costs while expanding neural network representation capacity

Accelerates language model optimization through virtual width expansion

Establishes scaling relationship between virtual width and performance improvement

Innovation

Methods, ideas, or system contributions that make the work stand out.

Virtual Width Networks decouple representational width from backbone width

Expanding embedding space while keeping backbone compute constant

Log-linear scaling relation between virtual width and loss reduction

🔎 Similar Papers

No similar papers found.