π€ AI Summary
This work proposes ARREST, a novel framework that addresses the intertwined challenges of factual inaccuracies and safety risks in large language model (LLM) generation. Rather than treating factuality and safety as separate alignment objectives, ARREST identifies their common origin in representational shifts within the modelβs latent activation space. By introducing an external regulation network that intervenes without fine-tuning the base model parameters, ARREST unifies factual correction with both soft and hard refusal mechanisms. Leveraging adversarial training and representation alignment techniques, the framework significantly enhances output factuality and safety while preserving generative quality. Notably, ARREST demonstrates superior soft refusal capabilities compared to RLHF-aligned models, exhibiting greater adaptability and robustness across diverse scenarios.
π Abstract
Human cognition, driven by complex neurochemical processes, oscillates between imagination and reality and learns to self-correct whenever such subtle drifts lead to hallucinations or unsafe associations. In recent years, LLMs have demonstrated remarkable performance in a wide range of tasks. However, they still lack human cognition to balance factuality and safety. Bearing the resemblance, we argue that both factual and safety failures in LLMs arise from a representational misalignment in their latent activation space, rather than addressing those as entirely separate alignment issues. We hypothesize that an external network, trained to understand the fluctuations, can selectively intervene in the model to regulate falsehood into truthfulness and unsafe output into safe output without fine-tuning the model parameters themselves. Reflecting the hypothesis, we propose ARREST (Adversarial Resilient Regulation Enhancing Safety and Truth), a unified framework that identifies and corrects drifted features, engaging both soft and hard refusals in addition to factual corrections. Our empirical results show that ARREST not only regulates misalignment but is also more versatile compared to the RLHF-aligned models in generating soft refusals due to adversarial training. We make our codebase available at https://github.com/sharanya-dasgupta001/ARREST.