Steer-to-Detect: Probing Hidden Representations for Detection of LLM-Generated Texts

📅 2026-05-12

📈 Citations: 0

✨ Influential: 0

career value

171K/year

🤖 AI Summary

Existing methods struggle to effectively distinguish between large language model (LLM)-generated and human-written text, primarily due to significant overlap in their hidden representations. To address this limitation, this work proposes Steer-to-Detect (S2D), a two-stage detection framework that first learns a steering vector to inject into the frozen LLM’s hidden states, actively enhancing inter-class separability. Detection is then performed via hypothesis testing on the steered representations. Notably, S2D is the first approach to explicitly shape representations through a steering vector and provides high-probability theoretical guarantees on both Type I and Type II errors under limited-sample settings. Experimental results demonstrate that S2D achieves consistently superior detection performance across in-distribution, out-of-distribution, and adversarially perturbed scenarios.

📝 Abstract

The rapid advancement of large language models (LLMs) has made machine-generated text increasingly difficult to distinguish from human-written text. While recent studies explore leveraging internal representations of language models to uncover deeper detection signals, these raw features often exhibit substantial overlap between classes, limiting their discriminative power. To address this challenge, we propose Steer-to-Detect (\texttt{S2D}), a two-stage framework for detecting LLM-generated text. In the first stage, \texttt{S2D} learns a steering vector that is injected into the hidden states of a frozen observer LLM, producing representations with improved class separability. In the second stage, detection is performed via a hypothesis testing procedure based on the steered representations. We establish finite-sample, high-probability guarantees for Type I and Type II errors, providing a theoretical characterization of the procedure. Empirically, \texttt{S2D} achieves strong and consistent performance across a range of settings, including out-of-distribution scenarios and adversarial perturbations.

Problem

Research questions and friction points this paper is trying to address.

LLM-generated text detection

hidden representations

class separability

machine-generated text

Innovation

Methods, ideas, or system contributions that make the work stand out.

steering vector

hidden representations

hypothesis testing

LLM-generated text detection