🤖 AI Summary
To address cross-stage risks in large language models (LLMs)—including privacy leakage, hallucination generation, value misalignment, malicious misuse, and jailbreaking attacks—this paper proposes the first unified responsible governance framework spanning four critical stages: data acquisition, alignment fine-tuning, prompt-based inference, and post-hoc auditing. Methodologically, it integrates differential privacy, retrieval-augmented generation (RAG), reinforcement learning from human feedback (RLHF) with proximal policy optimization (PPO), chain-of-thought prompting, self-consistency verification, adversarial testing, and explainable auditing. These techniques jointly enforce privacy preservation, hallucination suppression, value alignment, toxicity mitigation, and jailbreak resistance. Furthermore, the work constructs a structured knowledge graph and a comprehensive technical roadmap, transcending unidimensional risk analysis. The framework delivers systematic theoretical foundations and actionable guidelines for developing and deploying secure, trustworthy, and controllable LLMs.
📝 Abstract
While large language models (LLMs) present significant potential for supporting numerous real-world applications and delivering positive social impacts, they still face significant challenges in terms of the inherent risk of privacy leakage, hallucinated outputs, and value misalignment, and can be maliciously used for generating toxic content and unethical purposes after been jailbroken. Therefore, in this survey, we present a comprehensive review of recent advancements aimed at mitigating these issues, organized across the four phases of LLM development and usage: data collecting and pre-training, fine-tuning and alignment, prompting and reasoning, and post-processing and auditing. We elaborate on the recent advances for enhancing the performance of LLMs in terms of privacy protection, hallucination reduction, value alignment, toxicity elimination, and jailbreak defenses. In contrast to previous surveys that focus on a single dimension of responsible LLMs, this survey presents a unified framework that encompasses these diverse dimensions, providing a comprehensive view of enhancing LLMs to better serve real-world applications.