Enhancing the Capability and Robustness of Large Language Models through Reinforcement Learning-Driven Query Refinement

📅 2024-07-01
🏛️ arXiv.org
📈 Citations: 2
Influential: 0
📄 PDF
🤖 AI Summary
LLMs often generate false, harmful, or uninformative responses when exposed to short, ambiguous, or adversarial prompts. To address this, we propose a lightweight, plug-and-play query refinement framework that optimizes user inputs in real time—before they are fed to the LLM—using a compact Transformer model trained via multi-objective reinforcement learning (PPO). The model jointly maximizes three reward signals: truthfulness, harmlessness, and helpfulness. We introduce a novel “capability enhancement–robust defense” co-optimization paradigm, enabling seamless integration and cross-task transferability. Experiments demonstrate substantial improvements in factual consistency and safety on benchmarks including TruthfulQA and ToxiGen. Moreover, our method achieves over 92% defense success rate against prominent jailbreaking attacks such as GCG and AutoDAN, confirming its effectiveness in mitigating adversarial prompt manipulation while preserving utility.

Technology Category

Application Category

📝 Abstract
The capacity of large language models (LLMs) to generate honest, harmless, and helpful responses heavily relies on the quality of user prompts. However, these prompts often tend to be brief and vague, thereby significantly limiting the full potential of LLMs. Moreover, harmful prompts can be meticulously crafted and manipulated by adversaries to jailbreak LLMs, inducing them to produce potentially toxic content. To enhance the capabilities of LLMs while maintaining strong robustness against harmful jailbreak inputs, this study proposes a transferable and pluggable framework that refines user prompts before they are input into LLMs. This strategy improves the quality of the queries, empowering LLMs to generate more truthful, benign and useful responses. Specifically, a lightweight query refinement model is introduced and trained using a specially designed reinforcement learning approach that incorporates multiple objectives to enhance particular capabilities of LLMs. Extensive experiments demonstrate that the refinement model not only improves the quality of responses but also strengthens their robustness against jailbreak attacks. Code is available at: https://github.com/Huangzisu/query-refinement .
Problem

Research questions and friction points this paper is trying to address.

Improving LLM response quality through query refinement
Enhancing robustness against harmful jailbreak inputs
Developing a reinforcement learning-driven refinement framework
Innovation

Methods, ideas, or system contributions that make the work stand out.

Reinforcement Learning-Driven Query Refinement
Lightweight Query Refinement Model
Multiple Objectives Reinforcement Learning
🔎 Similar Papers
No similar papers found.
Z
Zisu Huang
School of Computer Science, Fudan University, Shanghai, China; Shanghai Key Laboratory of Intelligent Information Processing
X
Xiaohua Wang
School of Computer Science, Fudan University, Shanghai, China; Shanghai Key Laboratory of Intelligent Information Processing
Feiran Zhang
Feiran Zhang
Novo Nordisk
RNA BiologyEpigeneticsGenomicsBioinformaticsPharmacology
Zhibo Xu
Zhibo Xu
Fudan University
large language modelsagent rl
C
Cenyuan Zhang
School of Computer Science, Fudan University, Shanghai, China; Shanghai Key Laboratory of Intelligent Information Processing
Xiaoqing Zheng
Xiaoqing Zheng
Fudan University
Natural Language Processing and Machine Learning
X
Xuanjing Huang
School of Computer Science, Fudan University, Shanghai, China; Shanghai Key Laboratory of Intelligent Information Processing