🤖 AI Summary
Conventional neural network layers employ fixed widths, requiring manual hyperparameter tuning that lacks adaptability to varying task requirements and data complexity; moreover, width optimization for large models incurs prohibitive computational costs. Method: We propose an end-to-end learnable, unbounded-width neural layer that enables joint differentiable optimization of width and weights. Our approach introduces a soft neuron importance ranking, enabling zero-cost dynamic pruning and expansion; integrates gradient-driven joint training, importance-ranking regularization, and task-adaptive width control. Contribution/Results: Evaluated across tabular, image, text, and graph domains, our method automatically adjusts layer width to match task difficulty. It eliminates manual width tuning overhead while achieving lossless compression—maintaining original accuracy despite significant width reduction—thus jointly optimizing model performance and computational efficiency.
📝 Abstract
For almost 70 years, researchers have mostly relied on hyper-parameter tuning to pick the width of neural networks' layers out of many possible choices. This paper challenges the status quo by introducing an easy-to-use technique to learn an unbounded width of a neural network's layer during training. The technique does not rely on alternate optimization nor hand-crafted gradient heuristics; rather, it jointly optimizes the width and the parameters of each layer via simple backpropagation. We apply the technique to a broad range of data domains such as tables, images, texts, and graphs, showing how the width adapts to the task's difficulty. By imposing a soft ordering of importance among neurons, it is possible to truncate the trained network at virtually zero cost, achieving a smooth trade-off between performance and compute resources in a structured way. Alternatively, one can dynamically compress the network with no performance degradation. In light of recent foundation models trained on large datasets, believed to require billions of parameters and where hyper-parameter tuning is unfeasible due to huge training costs, our approach stands as a viable alternative for width learning.