SafeGPT: Preventing Data Leakage and Unethical Outputs in Enterprise LLM Use

📅 2026-01-10

🏛️ arXiv.org

📈 Citations: 0

✨ Influential: 0

career value

200K/year

🤖 AI Summary

This work addresses the critical security and ethical risks arising from enterprise employees inadvertently leaking sensitive data or generating policy-violating, unethical content when using large language models. To mitigate these risks, we propose SafeGPT—the first unified dual-sided protection framework that integrates input-side sensitive information detection and sanitization with output-side content moderation and rewriting. SafeGPT further incorporates a human-in-the-loop feedback mechanism to jointly optimize safety and user experience. By combining red-teaming attacks with reinforcement learning from human feedback, the system significantly reduces the likelihood of data leakage and biased outputs while maintaining high user satisfaction.

Technology Category

Application Category

📝 Abstract

Large Language Models (LLMs) are transforming enterprise workflows but introduce security and ethics challenges when employees inadvertently share confidential data or generate policy-violating content. This paper proposes SafeGPT, a two-sided guardrail system preventing sensitive data leakage and unethical outputs. SafeGPT integrates input-side detection/redaction, output-side moderation/reframing, and human-in-the-loop feedback. Experiments demonstrate SafeGPT effectively reduces data leakage risk and biased outputs while maintaining satisfaction.

Problem

Research questions and friction points this paper is trying to address.

data leakage

unethical outputs

enterprise LLMs

security

ethics

Innovation

Methods, ideas, or system contributions that make the work stand out.

guardrail system

data leakage prevention

output moderation