Language models recognize dropout and Gaussian noise applied to their activations

📅 2026-04-19

📈 Citations: 0

✨ Influential: 0

career value

185K/year

🤖 AI Summary

This study investigates whether large language models can identify, localize, and distinguish between dropout and Gaussian noise perturbations applied to their internal activations under a zero-shot setting. By perturbing activations of target sentence tokens and evaluating model performance through multiple-choice questions on perturbation type and location, the work reveals—for the first time—that mainstream large models (Llama, OLMo, and Qwen, ranging from 8B to 32B parameters) possess an intrinsic capacity to differentiate between distinct activation perturbations. Experimental results demonstrate that all models accurately detect and localize perturbations; notably, Qwen exhibits increasing accuracy with stronger perturbations in zero-shot evaluation and shows robustness to mislabeled options, suggesting it may implicitly encode signals related to its training mechanisms.

Technology Category

Application Category

📝 Abstract

We provide evidence that language models can detect, localize and, to a certain degree, verbalize the difference between perturbations applied to their activations. More precisely, we either (a) \emph{mask} activations, simulating \emph{dropout}, or (b) add \emph{Gaussian noise} to them, at a target sentence. We then ask a multiple-choice question such as ``\emph{Which of the previous sentences was perturbed?}'' or ``\emph{Which of the two perturbations was applied?}''. We test models from the Llama, Olmo, and Qwen families, with sizes between 8B and 32B, all of which can easily detect and localize the perturbations, often with perfect accuracy. These models can also learn, when taught in context, to distinguish between dropout and Gaussian noise. Notably, \qwenb's \emph{zero-shot} accuracy in identifying which perturbation was applied improves as a function of the perturbation strength and, moreover, decreases if the in-context labels are flipped, suggesting a prior for the correct ones -- even modulo controls. Because dropout has been used as a training-regularization technique, while Gaussian noise is sometimes added during inference, we discuss the possibility of a data-agnostic ``training awareness'' signal and the implications for AI safety. The code and data are available at \href{https://github.com/saifh-github/llm-dropout-noise-recognition}{link 1} and \href{https://drive.google.com/file/d/1es-Sfw_AH9GficeXgeqpy87rocrZZ_PQ/view}{link 2}, respectively.

Problem

Research questions and friction points this paper is trying to address.

language models

dropout

Gaussian noise

activation perturbations

AI safety

Innovation

Methods, ideas, or system contributions that make the work stand out.

dropout detection

Gaussian noise recognition

training awareness