Prefix Probing: Lightweight Harmful Content Detection for Large Language Models

📅 2025-12-18
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Addressing the tri-lemma among detection accuracy, inference latency, and deployment cost in real-time safety guarding for large language models (LLMs), this paper proposes a lightweight black-box harmful content detection method. The method discriminates harmful inputs solely via the conditional log-probability ratio between “execute” and “reject” prefixes under the first-token context—requiring no additional model invocation. We introduce a novel prefix-caching mechanism, integrated with automated high-discriminative prefix generation and a learnable optimization algorithm, achieving zero incremental deployment overhead. Experiments demonstrate that our approach matches the detection performance of mainstream external safety models while reducing inference latency to the first-token generation level and cutting computational cost by over 90%.

Technology Category

Application Category

📝 Abstract
Large language models often face a three-way trade-off among detection accuracy, inference latency, and deployment cost when used in real-world safety-sensitive applications. This paper introduces Prefix Probing, a black-box harmful content detection method that compares the conditional log-probabilities of "agreement/execution" versus "refusal/safety" opening prefixes and leverages prefix caching to reduce detection overhead to near first-token latency. During inference, the method requires only a single log-probability computation over the probe prefixes to produce a harmfulness score and apply a threshold, without invoking any additional models or multi-stage inference. To further enhance the discriminative power of the prefixes, we design an efficient prefix construction algorithm that automatically discovers highly informative prefixes, substantially improving detection performance. Extensive experiments demonstrate that Prefix Probing achieves detection effectiveness comparable to mainstream external safety models while incurring only minimal computational cost and requiring no extra model deployment, highlighting its strong practicality and efficiency.
Problem

Research questions and friction points this paper is trying to address.

Detects harmful content in LLMs with minimal latency
Balances accuracy, speed, and cost in safety applications
Avoids external models while maintaining detection performance
Innovation

Methods, ideas, or system contributions that make the work stand out.

Black-box detection using conditional log-probability comparison
Prefix caching reduces overhead to near first-token latency
Efficient algorithm automatically constructs informative prefixes
🔎 Similar Papers
No similar papers found.
J
Jirui Yang
Fudan University
H
Hengqi Guo
Fudan University
Z
Zhihui Lu
Fudan University
Y
Yi Zhao
Fudan University
Y
Yuansen Zhang
Ant Group
Shijing Hu
Shijing Hu
Fudan University
Edge Intellgence
Q
Qiang Duan
Pennsylvania State University
Y
Yinggui Wang
Ant Group
T
Tao Wei
Ant Group