Scaling Probabilistic Transformer via Efficient Cross-Scale Hyperparameter Transfer

📅 2026-04-28

📈 Citations: 0

✨ Influential: 0

career value

155K/year

🤖 AI Summary

This work addresses the challenge that Probabilistic Transformers exhibit high sensitivity to hyperparameters during scaling, hindering efficient model enlargement. The study introduces Maximal Update Parametrization (muP) into this architecture for the first time, enabling direct transfer of hyperparameters optimized on small-scale models to large-scale counterparts without re-tuning, through appropriate parameter rescaling. By doing so, the method overcomes the scaling bottleneck inherent in Probabilistic Transformers. Within a Masked Language Modeling framework, the approach successfully scales the architecture to 0.4 billion parameters and consistently outperforms standard Transformers at equivalent model sizes.

📝 Abstract

Probabilistic Transformer (PT), a white-box probabilistic model for contextual word representation, has demonstrated substantial similarity to standard Transformers in both computational structure and downstream task performance on small models and small to medium sized datasets. However, PT is less robust to hyperparameter choices than standard Transformers, making it harder to scale efficiently. In this work, we follow Maximal Update Parametrization (muP) to rescale PT's parameters, so that hyperparameters optimized on small models can be transferred to larger models without additional tuning. With this approach, we successfully scale PT to models with up to 0.4B parameters. Experiments show that PT consistently outperforms standard transformer under the same parameter budget on Masked Language Modeling (MLM) tasks. We hope this work will contribute to the practical deployment of probabilistic models at substantially larger scales in the future.

Problem

Research questions and friction points this paper is trying to address.

Probabilistic Transformer

hyperparameter transfer

model scaling

robustness

large-scale models

Innovation

Methods, ideas, or system contributions that make the work stand out.

Probabilistic Transformer

Maximal Update Parametrization

hyperparameter transfer