🤖 AI Summary
This study investigates whether large language models can identify, localize, and distinguish between dropout and Gaussian noise perturbations applied to their internal activations under a zero-shot setting. By perturbing activations of target sentence tokens and evaluating model performance through multiple-choice questions on perturbation type and location, the work reveals—for the first time—that mainstream large models (Llama, OLMo, and Qwen, ranging from 8B to 32B parameters) possess an intrinsic capacity to differentiate between distinct activation perturbations. Experimental results demonstrate that all models accurately detect and localize perturbations; notably, Qwen exhibits increasing accuracy with stronger perturbations in zero-shot evaluation and shows robustness to mislabeled options, suggesting it may implicitly encode signals related to its training mechanisms.
📝 Abstract
We provide evidence that language models can detect, localize and, to a certain degree, verbalize the difference between perturbations applied to their activations. More precisely, we either (a) \emph{mask} activations, simulating \emph{dropout}, or (b) add \emph{Gaussian noise} to them, at a target sentence. We then ask a multiple-choice question such as ``\emph{Which of the previous sentences was perturbed?}'' or ``\emph{Which of the two perturbations was applied?}''.
We test models from the Llama, Olmo, and Qwen families, with sizes between 8B and 32B, all of which can easily detect and localize the perturbations, often with perfect accuracy. These models can also learn, when taught in context, to distinguish between dropout and Gaussian noise. Notably, \qwenb's \emph{zero-shot} accuracy in identifying which perturbation was applied improves as a function of the perturbation strength and, moreover, decreases if the in-context labels are flipped, suggesting a prior for the correct ones -- even modulo controls.
Because dropout has been used as a training-regularization technique, while Gaussian noise is sometimes added during inference, we discuss the possibility of a data-agnostic ``training awareness'' signal and the implications for AI safety.
The code and data are available at \href{https://github.com/saifh-github/llm-dropout-noise-recognition}{link 1} and \href{https://drive.google.com/file/d/1es-Sfw_AH9GficeXgeqpy87rocrZZ_PQ/view}{link 2}, respectively.