🤖 AI Summary
How can neural network representational capacity and optimization efficiency be improved without incurring quadratic computational overhead from increasing hidden-layer width? This paper proposes Virtual Width Networks (VWNs), the first approach to decouple *representational width* from *backbone width*: instead of enlarging hidden layers, VWNs expand the embedding space while preserving near-constant backbone computation. Theoretically and empirically, we establish an approximate log-linear scaling relationship between virtual width and training loss reduction—introducing a novel dimension for large-model efficiency optimization. The mechanism enables rapid convergence under multi-fold width expansion. Extensive experiments demonstrate that an 8× virtual width acceleration yields over 2× speedup for next-token prediction and 3× for two-token prediction, with accelerating gains intensifying throughout training.
📝 Abstract
We introduce Virtual Width Networks (VWN), a framework that delivers the benefits of wider representations without incurring the quadratic cost of increasing the hidden size. VWN decouples representational width from backbone width, expanding the embedding space while keeping backbone compute nearly constant. In our large-scale experiment, an 8-times expansion accelerates optimization by over 2 times for next-token and 3 times for next-2-token prediction. The advantage amplifies over training as both the loss gap grows and the convergence-speedup ratio increases, showing that VWN is not only token-efficient but also increasingly effective with scale. Moreover, we identify an approximately log-linear scaling relation between virtual width and loss reduction, offering an initial empirical basis and motivation for exploring virtual-width scaling as a new dimension of large-model efficiency.