🤖 AI Summary
This work proposes SafeTune, a novel approach that applies search-based multi-objective optimization to jointly tune system prompts and decoding hyperparameters in large language models, aiming to reduce harmful outputs while preserving response relevance. Evaluated on the Qwen-1.8B model, SafeTune significantly decreases the proportion of harmful responses and substantially improves prompt-response alignment, with both effects demonstrating large practical significance. Notably, the study reveals that moderately increasing response repetitiveness can facilitate this dual objective, offering a new perspective on safety alignment in language model deployment.
📝 Abstract
The widespread adoption of Large Language Models (LLMs) raises concerns about the potential harmfulness of their responses. In this paper, we first investigate the harmfulness of responses from four general-purpose LLMs. Next, we propose SafeTune, a multi-objective search-based approach to mitigate harmfulness while increasing response relevance through hyperparameter tuning and system prompt engineering. Our initial evaluation shows that SafeTune significantly reduces the rate of harmful responses generated by Qwen3.5 0.8B and increases prompt-response relevance (both with a large effect size). Among the parameters we explore, we also find that encouraging greater repetition in responses is most impactful in reducing harmfulness while increasing relevance.