TokenFormer: Rethinking Transformer Scaling with Tokenized Model Parameters

📅 2024-10-30
🏛️ arXiv.org
📈 Citations: 1
Influential: 0
📄 PDF
🤖 AI Summary
Scaling Transformer models is prohibitively expensive due to fixed-parameter linear projection layers; architectural modifications necessitate full retraining. Method: We propose TokenFormer, the first architecture introducing *parameter tokenization*, which models model parameters as learnable tokens and replaces all linear layers with token-parameter self-attention—unifying parameter and input token representations in a shared latent space. Contribution/Results: Our method enables zero-shot, progressive parameter expansion without retraining, overcoming classical scaling bottlenecks. Without altering network topology, we scale model parameters from 124M to 1.4B while matching the performance of fully trained baselines, achieving substantial training cost reduction. The code and models are publicly released.

Technology Category

Application Category

📝 Abstract
Transformers have become the predominant architecture in foundation models due to their excellent performance across various domains. However, the substantial cost of scaling these models remains a significant concern. This problem arises primarily from their dependence on a fixed number of parameters within linear projections. When architectural modifications (e.g., channel dimensions) are introduced, the entire model typically requires retraining from scratch. As model sizes continue growing, this strategy results in increasingly high computational costs and becomes unsustainable. To overcome this problem, we introduce TokenFormer, a natively scalable architecture that leverages the attention mechanism not only for computations among input tokens but also for interactions between tokens and model parameters, thereby enhancing architectural flexibility. By treating model parameters as tokens, we replace all the linear projections in Transformers with our token-parameter attention layer, where input tokens act as queries and model parameters as keys and values. This reformulation allows for progressive and efficient scaling without necessitating retraining from scratch. Our model scales from 124M to 1.4B parameters by incrementally adding new key-value parameter pairs, achieving performance comparable to Transformers trained from scratch while greatly reducing training costs. Code and models are available at https://github.com/Haiyang-W/TokenFormer.
Problem

Research questions and friction points this paper is trying to address.

High computational cost of scaling Transformer models
Dependence on fixed parameters requiring full retraining
Lack of efficient progressive scaling for large models
Innovation

Methods, ideas, or system contributions that make the work stand out.

Tokenized model parameters for scalable architecture
Token-parameter attention replaces linear projections
Progressive scaling without retraining from scratch
H
Haiyang Wang
Max Planck Institute for Informatics, Peking University
Y
Yue Fan
Max Planck Institute for Informatics
Muhammad Ferjad Naeem
Muhammad Ferjad Naeem
Research Scientist, Google
Artificial IntelligenceComputer VisionMachine LearningDeep Learning
Yongqin Xian
Yongqin Xian
Google
Computer VisionMachine Learning
J
J. E. Lenssen
Max Planck Institute for Informatics
L
Liwei Wang
Peking University
Federico Tombari
Federico Tombari
Google, TU Munich
Computer VisionMachine Learning3D Perception
B
B. Schiele
Max Planck Institute for Informatics