🤖 AI Summary
Current alignment methods for large language models rely heavily on extensive datasets of harmful prompts, limiting their generalization to novel attacks and distributional shifts. This work proposes an alignment mechanism grounded in abstract personality traits, integrated with latent adversarial training, which achieves effective harmlessness alignment without exposure to any harmful examples. Requiring fewer than 100 personality statements, the method reduces misclassification rates by a factor of 2.6 across six harm benchmarks, matching the defensive performance of approaches that necessitate over 150,000 samples while maintaining superior practicality and significantly enhancing generalization to unseen attacks.
📝 Abstract
Current adversarial robustness methods for large language models require extensive datasets of harmful prompts (thousands to hundreds of thousands of examples), yet remain vulnerable to novel attack vectors and distributional shifts. We propose Latent Personality Alignment (LPA), a sample-efficient defense that achieves robustness by training models on abstract personality traits rather than specific harmful behaviors. Using fewer than 100 trait statements and latent adversarial training, LPA achieves comparable attack success rates to methods trained on 150k+ examples, while maintaining superior utility. Critically, LPA generalizes better to unseen attack distributions, reducing misclassification rates by 2.6x compared to baseline across six harm benchmarks -- without ever seeing harmful examples during training. Our results demonstrate that personality-based alignment offers a principled approach to building robust defenses with minimal cost.