Learning Safety Constraints for Large Language Models

📅 2025-05-30
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Large language models (LLMs) face a fundamental trade-off between safety—e.g., mitigating harmful outputs and adversarial attacks—and task performance degradation. Method: This paper introduces Safety Polytope (SaP), a geometric approach that explicitly encodes multidimensional safety constraints within the LLM’s latent space. SaP represents safe and unsafe regions as interpretable polyhedral boundaries, enabling weight-invariant, post-hoc safety calibration. Leveraging geometric deep learning and adversarial robustness optimization, SaP induces spontaneous specialization of polyhedral facets across distinct safety dimensions (e.g., ethics, factual consistency). Contribution/Results: SaP significantly improves detection of ethically violating inputs (+23.6%), reduces adversarial attack success rates (−41.2%), and preserves original task performance without degradation. Crucially, it provides human-interpretable, geometry-grounded safety decisions—offering the first framework to unify rigorous safety enforcement with transparent, constraint-aware latent-space reasoning in LLMs.

Technology Category

Application Category

📝 Abstract
Large language models (LLMs) have emerged as powerful tools but pose significant safety risks through harmful outputs and vulnerability to adversarial attacks. We propose SaP, short for Safety Polytope, a geometric approach to LLM safety that learns and enforces multiple safety constraints directly in the model's representation space. We develop a framework that identifies safe and unsafe regions via the polytope's facets, enabling both detection and correction of unsafe outputs through geometric steering. Unlike existing approaches that modify model weights, SaP operates post-hoc in the representation space, preserving model capabilities while enforcing safety constraints. Experiments across multiple LLMs demonstrate that our method can effectively detect unethical inputs, reduce adversarial attack success rates while maintaining performance on standard tasks, thus highlighting the importance of having an explicit geometric model for safety. Analysis of the learned polytope facets reveals emergence of specialization in detecting different semantic notions of safety, providing interpretable insights into how safety is captured in LLMs' representation space.
Problem

Research questions and friction points this paper is trying to address.

Preventing harmful outputs from large language models
Reducing vulnerability to adversarial attacks in LLMs
Enforcing safety constraints without modifying model weights
Innovation

Methods, ideas, or system contributions that make the work stand out.

Geometric approach enforces safety constraints
Post-hoc representation space operation
Interpretable polytope facets specialization
🔎 Similar Papers
No similar papers found.
X
Xin Chen
Department of Computer Science, ETH Zürich, Zürich, Switzerland
Yarden As
Yarden As
ETH Zürich
Artificial IntelligenceReinforcement LearningRobotics
A
Andreas Krause
Department of Computer Science, ETH Zürich, Zürich, Switzerland