Preemptive Detection and Steering of LLM Misalignment via Latent Reachability

📅 2025-09-25

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

Large language models (LLMs) remain prone to generating harmful content during inference, posing persistent safety risks. Method: This paper proposes BRT-Align, the first framework to integrate backward reachability analysis—a control-theoretic technique—into LLM latent-space modeling. It formulates text generation as a dynamic system, enabling proactive prediction and blocking of unsafe generation trajectories. The method jointly incorporates latent-space autoregressive modeling, backward reachability learning, safety-value function training, and runtime latent-variable filtering to achieve early detection and minimally intrusive, safety-guided generation. Results: Extensive experiments across multiple LLMs and toxicity benchmarks demonstrate that BRT-Align detects unsafe tendencies significantly earlier than state-of-the-art methods, substantially reduces toxic output, and preserves textual diversity, coherence, and semantic fidelity—effectively mitigating aggression, vulgarity, and bias without compromising generation quality.

Technology Category

Application Category

📝 Abstract

Large language models (LLMs) are now ubiquitous in everyday tools, raising urgent safety concerns about their tendency to generate harmful content. The dominant safety approach -- reinforcement learning from human feedback (RLHF) -- effectively shapes model behavior during training but offers no safeguards at inference time, where unsafe continuations may still arise. We propose BRT-Align, a reachability-based framework that brings control-theoretic safety tools to LLM inference. BRT-Align models autoregressive generation as a dynamical system in latent space and learn a safety value function via backward reachability, estimating the worst-case evolution of a trajectory. This enables two complementary mechanisms: (1) a runtime monitor that forecasts unsafe completions several tokens in advance, and (2) a least-restrictive steering filter that minimally perturbs latent states to redirect generation away from unsafe regions. Experiments across multiple LLMs and toxicity benchmarks demonstrate that BRT-Align provides more accurate and earlier detection of unsafe continuations than baselines. Moreover, for LLM safety alignment, BRT-Align substantially reduces unsafe generations while preserving sentence diversity and coherence. Qualitative results further highlight emergent alignment properties: BRT-Align consistently produces responses that are less violent, less profane, less offensive, and less politically biased. Together, these findings demonstrate that reachability analysis provides a principled and practical foundation for inference-time LLM safety.

Problem

Research questions and friction points this paper is trying to address.

Detecting unsafe LLM outputs before generation using reachability analysis

Steering language models away from harmful content during inference

Providing real-time safety monitoring without retraining the model

Innovation

Methods, ideas, or system contributions that make the work stand out.

Models generation as dynamical system in latent space

Learns safety value function via backward reachability

Uses runtime monitor and steering filter for safety

🔎 Similar Papers

No similar papers found.

Authors to Follow