Prefix Probing: Lightweight Harmful Content Detection for Large Language Models

📅 2025-12-18

📈 Citations: 0

✨ Influential: 0

career value

203K/year

🤖 AI Summary

Addressing the tri-lemma among detection accuracy, inference latency, and deployment cost in real-time safety guarding for large language models (LLMs), this paper proposes a lightweight black-box harmful content detection method. The method discriminates harmful inputs solely via the conditional log-probability ratio between “execute” and “reject” prefixes under the first-token context—requiring no additional model invocation. We introduce a novel prefix-caching mechanism, integrated with automated high-discriminative prefix generation and a learnable optimization algorithm, achieving zero incremental deployment overhead. Experiments demonstrate that our approach matches the detection performance of mainstream external safety models while reducing inference latency to the first-token generation level and cutting computational cost by over 90%.

Technology Category

Application Category

📝 Abstract

Large language models often face a three-way trade-off among detection accuracy, inference latency, and deployment cost when used in real-world safety-sensitive applications. This paper introduces Prefix Probing, a black-box harmful content detection method that compares the conditional log-probabilities of "agreement/execution" versus "refusal/safety" opening prefixes and leverages prefix caching to reduce detection overhead to near first-token latency. During inference, the method requires only a single log-probability computation over the probe prefixes to produce a harmfulness score and apply a threshold, without invoking any additional models or multi-stage inference. To further enhance the discriminative power of the prefixes, we design an efficient prefix construction algorithm that automatically discovers highly informative prefixes, substantially improving detection performance. Extensive experiments demonstrate that Prefix Probing achieves detection effectiveness comparable to mainstream external safety models while incurring only minimal computational cost and requiring no extra model deployment, highlighting its strong practicality and efficiency.

Problem

Research questions and friction points this paper is trying to address.

Detects harmful content in LLMs with minimal latency

Balances accuracy, speed, and cost in safety applications

Avoids external models while maintaining detection performance

Innovation

Methods, ideas, or system contributions that make the work stand out.

Black-box detection using conditional log-probability comparison

Prefix caching reduces overhead to near first-token latency

Efficient algorithm automatically constructs informative prefixes

🔎 Similar Papers

No similar papers found.