๐ค AI Summary
Sampling-based Bayesian learning is challenging to deploy efficiently in risk-sensitive scenarios due to its substantial memory and computational demands. This work proposes a novel sampling-parallel strategy that, for the first time, leverages the parameter sample dimension as the primary axis for multi-GPU parallelization. Without altering model architecture or hyperparameters, the approach significantly alleviates memory pressure and accelerates training. It seamlessly integrates with data parallelism to form a hybrid parallel framework and remains fully compatible with existing Bayesian neural network training pipelines. Experimental results demonstrate that the proposed method achieves near-linear scaling efficiency, substantially reduces the number of convergence iterations while preserving model accuracy, and markedly improves both training speed and resource utilization.
๐ Abstract
Machine learning models, and deep neural networks in particular, are increasingly deployed in risk-sensitive domains such as healthcare, environmental forecasting, and finance, where reliable quantification of predictive uncertainty is essential. However, many uncertainty quantification (UQ) methods remain difficult to apply due to their substantial computational cost. Sampling-based Bayesian learning approaches, such as Bayesian neural networks (BNNs), are particularly expensive since drawing and evaluating multiple parameter samples rapidly exhausts memory and compute resources. These constraints have limited the accessibility and exploration of Bayesian techniques thus far. To address these challenges, we introduce sampling parallelism, a simple yet powerful parallelization strategy that targets the primary bottleneck of sampling-based Bayesian learning: the samples themselves. By distributing sample evaluations across multiple GPUs, our method reduces memory pressure and training time without requiring architectural changes or extensive hyperparameter tuning. We detail the methodology and evaluate its performance on a few example tasks and architectures, comparing against distributed data parallelism (DDP) as a baseline. We further demonstrate that sampling parallelism is complementary to existing strategies by implementing a hybrid approach that combines sample and data parallelism. Our experiments show near-perfect scaling when the sample number is scaled proportionally to the computational resources, confirming that sample evaluations parallelize cleanly. Although DDP achieves better raw speedups under scaling with constant workload, sampling parallelism has a notable advantage: by applying independent stochastic augmentations to the same batch on each GPU, it increases augmentation diversity and thus reduces the number of epochs required for convergence.