Guardians and Offenders: A Survey on Harmful Content Generation and Safety Mitigation

📅 2025-08-07
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Large language models (LLMs) pose dual societal risks—unintended or malicious generation of toxic, biased, and offensive content—constituting an urgent socio-technical challenge. To address this, we conduct a systematic literature review and technical analysis to propose the first unified taxonomy for LLM safety, covering both textual and multimodal scenarios; it integrates harm categories (e.g., jailbreaking, multimodal misuse) and defense mechanisms (e.g., RLHF, prompt engineering, LLM-augmented detection). Our analysis identifies critical limitations in current evaluation methodologies—particularly regarding dynamism, cross-modal generalizability, and human-AI collaborative governance. We therefore advocate a forward-looking research agenda centered on dynamic defense strategies and human-in-the-loop governance. This work delivers a systematic theoretical framework and an extensible research paradigm for advancing LLM safety alignment.

Technology Category

Application Category

📝 Abstract
Large Language Models (LLMs) have revolutionized content creation across digital platforms, offering unprecedented capabilities in natural language generation and understanding. These models enable beneficial applications such as content generation, question and answering (Q&A), programming, and code reasoning. Meanwhile, they also pose serious risks by inadvertently or intentionally producing toxic, offensive, or biased content. This dual role of LLMs, both as powerful tools for solving real-world problems and as potential sources of harmful language, presents a pressing sociotechnical challenge. In this survey, we systematically review recent studies spanning unintentional toxicity, adversarial jailbreaking attacks, and content moderation techniques. We propose a unified taxonomy of LLM-related harms and defenses, analyze emerging multimodal and LLM-assisted jailbreak strategies, and assess mitigation efforts, including reinforcement learning with human feedback (RLHF), prompt engineering, and safety alignment. Our synthesis highlights the evolving landscape of LLM safety, identifies limitations in current evaluation methodologies, and outlines future research directions to guide the development of robust and ethically aligned language technologies.
Problem

Research questions and friction points this paper is trying to address.

LLMs generate harmful toxic offensive biased content
Need defenses against adversarial jailbreaking attacks
Improve safety via RLHF prompt engineering alignment
Innovation

Methods, ideas, or system contributions that make the work stand out.

Systematic review of LLM-related harms and defenses
Analyze multimodal and LLM-assisted jailbreak strategies
Assess mitigation efforts like RLHF and prompt engineering
🔎 Similar Papers
No similar papers found.