SafeTune: Search-based Harmfulness Minimisation for Large Language Models

📅 2026-05-08

📈 Citations: 0

✨ Influential: 0

career value

195K/year

🤖 AI Summary

This work proposes SafeTune, a novel approach that applies search-based multi-objective optimization to jointly tune system prompts and decoding hyperparameters in large language models, aiming to reduce harmful outputs while preserving response relevance. Evaluated on the Qwen-1.8B model, SafeTune significantly decreases the proportion of harmful responses and substantially improves prompt-response alignment, with both effects demonstrating large practical significance. Notably, the study reveals that moderately increasing response repetitiveness can facilitate this dual objective, offering a new perspective on safety alignment in language model deployment.

📝 Abstract

The widespread adoption of Large Language Models (LLMs) raises concerns about the potential harmfulness of their responses. In this paper, we first investigate the harmfulness of responses from four general-purpose LLMs. Next, we propose SafeTune, a multi-objective search-based approach to mitigate harmfulness while increasing response relevance through hyperparameter tuning and system prompt engineering. Our initial evaluation shows that SafeTune significantly reduces the rate of harmful responses generated by Qwen3.5 0.8B and increases prompt-response relevance (both with a large effect size). Among the parameters we explore, we also find that encouraging greater repetition in responses is most impactful in reducing harmfulness while increasing relevance.

Problem

Research questions and friction points this paper is trying to address.

harmfulness

large language models

response relevance

safety mitigation

Innovation

Methods, ideas, or system contributions that make the work stand out.

search-based optimization

harmfulness mitigation

prompt engineering

hyperparameter tuning