🤖 AI Summary
Large language models (LLMs) face a fundamental trade-off between safety—e.g., mitigating harmful outputs and adversarial attacks—and task performance degradation.
Method: This paper introduces Safety Polytope (SaP), a geometric approach that explicitly encodes multidimensional safety constraints within the LLM’s latent space. SaP represents safe and unsafe regions as interpretable polyhedral boundaries, enabling weight-invariant, post-hoc safety calibration. Leveraging geometric deep learning and adversarial robustness optimization, SaP induces spontaneous specialization of polyhedral facets across distinct safety dimensions (e.g., ethics, factual consistency).
Contribution/Results: SaP significantly improves detection of ethically violating inputs (+23.6%), reduces adversarial attack success rates (−41.2%), and preserves original task performance without degradation. Crucially, it provides human-interpretable, geometry-grounded safety decisions—offering the first framework to unify rigorous safety enforcement with transparent, constraint-aware latent-space reasoning in LLMs.
📝 Abstract
Large language models (LLMs) have emerged as powerful tools but pose significant safety risks through harmful outputs and vulnerability to adversarial attacks. We propose SaP, short for Safety Polytope, a geometric approach to LLM safety that learns and enforces multiple safety constraints directly in the model's representation space. We develop a framework that identifies safe and unsafe regions via the polytope's facets, enabling both detection and correction of unsafe outputs through geometric steering. Unlike existing approaches that modify model weights, SaP operates post-hoc in the representation space, preserving model capabilities while enforcing safety constraints. Experiments across multiple LLMs demonstrate that our method can effectively detect unethical inputs, reduce adversarial attack success rates while maintaining performance on standard tasks, thus highlighting the importance of having an explicit geometric model for safety. Analysis of the learned polytope facets reveals emergence of specialization in detecting different semantic notions of safety, providing interpretable insights into how safety is captured in LLMs' representation space.