🤖 AI Summary
This work addresses the high deployment cost of large language models and their limited adaptability to varying computational budgets without retraining. To this end, the authors propose an importance-ordered nested low-rank decomposition mechanism that extracts low-rank components from a pretrained model in descending order of importance, thereby constructing a scalable family of submodels. This approach enables dynamic activation of an appropriate submodel based on available computational resources, achieving “train once, deploy anywhere” while continuously trading off performance against cost without any retraining. Experimental results demonstrate that the method substantially reduces deployment overhead while maintaining accuracy close to that of the original model across a wide range of computational budgets.
📝 Abstract
The growing scale of deep neural networks, encompassing large language models (LLMs) and vision transformers (ViTs), has made training from scratch prohibitively expensive and deployment increasingly costly. These models are often used as computational monoliths with fixed cost, a rigidity that does not leverage overparametrized architectures and largely hinders adaptive deployment across different cost budgets. We argue that importance-ordered nested components can be extracted from pretrained models, and selectively activated on the available computational budget. To this end, our proposed FlexRank method leverages low-rank weight decomposition with nested, importance-based consolidation to extract submodels of increasing capabilities. Our approach enables a"train-once, deploy-everywhere"paradigm that offers a graceful trade-off between cost and performance without training from scratch for each budget - advancing practical deployment of large models.