π€ AI Summary
This work addresses the critical security and ethical risks arising from enterprise employees inadvertently leaking sensitive data or generating policy-violating, unethical content when using large language models. To mitigate these risks, we propose SafeGPTβthe first unified dual-sided protection framework that integrates input-side sensitive information detection and sanitization with output-side content moderation and rewriting. SafeGPT further incorporates a human-in-the-loop feedback mechanism to jointly optimize safety and user experience. By combining red-teaming attacks with reinforcement learning from human feedback, the system significantly reduces the likelihood of data leakage and biased outputs while maintaining high user satisfaction.
π Abstract
Large Language Models (LLMs) are transforming enterprise workflows but introduce security and ethics challenges when employees inadvertently share confidential data or generate policy-violating content. This paper proposes SafeGPT, a two-sided guardrail system preventing sensitive data leakage and unethical outputs. SafeGPT integrates input-side detection/redaction, output-side moderation/reframing, and human-in-the-loop feedback. Experiments demonstrate SafeGPT effectively reduces data leakage risk and biased outputs while maintaining satisfaction.