Beyond Linear Probes: Dynamic Safety Monitoring for Language Models

📅 2025-09-30

📈 Citations: 0

✨ Influential: 0

career value

195K/year

🤖 AI Summary

Traditional LLM safety monitors incur fixed computational overhead, struggling to balance efficiency on benign inputs with high detection rates for sophisticated threats. To address this, we propose a dynamic safety monitoring mechanism based on Truncated Polynomial Classifiers (TPCs), enabling layer-wise, early-exit progressive probing for fine-grained harmful request detection within language model activation layers. Our method integrates linear probe extension, progressive training and inference, dynamic truncation, and an adaptive cascaded architecture—supporting on-demand switching between lightweight detection and robust protection modes. Evaluated on WildGuardMix and BeaverTails across four models up to 30B parameters, TPCs match or surpass comparably sized MLP probes in detection performance while significantly improving computational efficiency and decision interpretability.

Technology Category

Application Category

📝 Abstract

Monitoring large language models' (LLMs) activations is an effective way to detect harmful requests before they lead to unsafe outputs. However, traditional safety monitors often require the same amount of compute for every query. This creates a trade-off: expensive monitors waste resources on easy inputs, while cheap ones risk missing subtle cases. We argue that safety monitors should be flexible--costs should rise only when inputs are difficult to assess, or when more compute is available. To achieve this, we introduce Truncated Polynomial Classifiers (TPCs), a natural extension of linear probes for dynamic activation monitoring. Our key insight is that polynomials can be trained and evaluated progressively, term-by-term. At test-time, one can early-stop for lightweight monitoring, or use more terms for stronger guardrails when needed. TPCs provide two modes of use. First, as a safety dial: by evaluating more terms, developers and regulators can "buy" stronger guardrails from the same model. Second, as an adaptive cascade: clear cases exit early after low-order checks, and higher-order guardrails are evaluated only for ambiguous inputs, reducing overall monitoring costs. On two large-scale safety datasets (WildGuardMix and BeaverTails), for 4 models with up to 30B parameters, we show that TPCs compete with or outperform MLP-based probe baselines of the same size, all the while being more interpretable than their black-box counterparts. Our code is available at http://github.com/james-oldfield/tpc.

Problem

Research questions and friction points this paper is trying to address.

Dynamic safety monitoring for language models' activations

Reducing computational costs in harmful request detection

Providing flexible guardrails through progressive polynomial evaluation

Innovation

Methods, ideas, or system contributions that make the work stand out.

TPCs enable progressive polynomial evaluation for monitoring

Dynamic activation monitoring adapts compute to input difficulty

Early stopping reduces costs while maintaining safety guardrails

🔎 Similar Papers

Finding Safety Neurons in Large Language Models