Risk-Averse Finetuning of Large Language Models

📅 2025-01-12
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Large language models (LLMs) occasionally generate rare yet highly harmful toxic outputs—a tail-risk phenomenon inadequately mitigated by conventional expectation-based optimization objectives. Method: This paper proposes a robust fine-tuning paradigm grounded in Conditional Value-at-Risk (CVaR), the first application of CVaR to LLM alignment training. By explicitly modeling and controlling tail risk, it overcomes the limitation of methods like RLHF that optimize only for average-case performance. The approach integrates risk-aware reinforcement learning, toxicity detection, and preference modeling to jointly optimize toxicity mitigation and generation quality. Results: On sentiment rewriting and toxicity mitigation benchmarks, the method reduces toxic output rates by up to 47%, without compromising task performance or textual quality—thereby significantly enhancing LLM safety and robustness.

Technology Category

Application Category

📝 Abstract
We consider the challenge of mitigating the generation of negative or toxic content by the Large Language Models (LLMs) in response to certain prompts. We propose integrating risk-averse principles into LLM fine-tuning to minimize the occurrence of harmful outputs, particularly rare but significant events. By optimizing the risk measure of Conditional Value at Risk (CVaR), our methodology trains LLMs to exhibit superior performance in avoiding toxic outputs while maintaining effectiveness in generative tasks. Empirical evaluations on sentiment modification and toxicity mitigation tasks demonstrate the efficacy of risk-averse reinforcement learning with human feedback (RLHF) in promoting a safer and more constructive online discourse environment.
Problem

Research questions and friction points this paper is trying to address.

Large Language Model
Harmful Content Prevention
Creativity Preservation
Innovation

Methods, ideas, or system contributions that make the work stand out.

Optimized Conditional Value at Risk
Risk-Averse Reinforcement Learning
Harmful Language Mitigation
🔎 Similar Papers
No similar papers found.