Stochastic activations

📅 2025-09-26
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the gradient vanishing problem in ReLU activations within LLM feed-forward layers—specifically, zero gradients for negative inputs—which hinders optimization. We propose a Bernoulli stochastic activation mechanism: during pretraining, SiLU and ReLU are dynamically mixed via probabilistic switching; during fine-tuning and inference, the activation is fixed to ReLU. This design balances training flexibility and deployment efficiency—mitigating gradient blocking and improving convergence quality while preserving sparse hidden states and low FLOPs, thereby significantly accelerating CPU inference. Moreover, it enables controllable output diversity in generative tasks, achieving performance comparable to optimal deterministic alternatives (e.g., temperature-scaled SiLU) and surpassing a ReLU-only baseline trained from scratch. Our core contribution is the first integration of a lightweight stochastic nonlinearity into LLM feed-forward layers, enabling joint training-inference optimization without increasing inference overhead.

Technology Category

Application Category

📝 Abstract
We introduce stochastic activations. This novel strategy randomly selects between several non-linear functions in the feed-forward layer of a large language model. In particular, we choose between SILU or RELU depending on a Bernoulli draw. This strategy circumvents the optimization problem associated with RELU, namely, the constant shape for negative inputs that prevents the gradient flow. We leverage this strategy in two ways: (1) We use stochastic activations during pre-training and fine-tune the model with RELU, which is used at inference time to provide sparse latent vectors. This reduces the inference FLOPs and translates into a significant speedup in the CPU. Interestingly, this leads to much better results than training from scratch with the RELU activation function. (2) We evaluate stochastic activations for generation. This strategy performs reasonably well: it is only slightly inferior to the best deterministic non-linearity, namely SILU combined with temperature scaling. This offers an alternative to existing strategies by providing a controlled way to increase the diversity of the generated text.
Problem

Research questions and friction points this paper is trying to address.

Randomly selects activation functions to avoid optimization issues
Reduces inference FLOPs through stochastic pre-training with ReLU fine-tuning
Increases text diversity in generation via controlled stochastic activations
Innovation

Methods, ideas, or system contributions that make the work stand out.

Stochastic activations randomly select nonlinear functions
Combines SILU and RELU via Bernoulli distribution
Enables training flexibility and inference efficiency optimization
🔎 Similar Papers
No similar papers found.