Steer Model beyond Assistant: Controlling System Prompt Strength via Contrastive Decoding

📅 2026-01-10
🏛️ arXiv.org
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the limited ability of large language models to adhere to system prompts that conflict with their default “helpful assistant” role. The authors propose a training-free, dynamic control method that, for the first time, models prompt adherence as a continuously adjustable variable. By contrasting the output logits of the target and default prompts and scaling their difference with a scalar factor α, the approach isolates and amplifies the desired behavioral signal. Built upon contrastive decoding, the method achieves significant performance gains across five benchmarks—including IFEval, OffTopicEval, and Prompt-Steering—yielding an 8.5% improvement in strict accuracy, a 45-percentage-point increase in refusal rate, and a 13% gain in controllability, thereby overcoming the constraints imposed by fixed-role assumptions.

Technology Category

Application Category

📝 Abstract
Large language models excel at complex instructions yet struggle to deviate from their helpful assistant persona, as post-training instills strong priors that resist conflicting instructions. We introduce system prompt strength, a training-free method that treats prompt adherence as a continuous control. By contrasting logits from target and default system prompts, we isolate and amplify the behavioral signal unique to the target persona by a scalar factor alpha. Across five diverse benchmarks spanning constraint satisfaction, behavioral control, pluralistic alignment, capability modulation, and stylistic control, our method yields substantial improvements: up to +8.5 strict accuracy on IFEval, +45pp refusal rate on OffTopicEval, and +13% steerability on Prompt-Steering. Our approach enables practitioners to modulate system prompt strength, providing dynamic control over model behavior without retraining.
Problem

Research questions and friction points this paper is trying to address.

system prompt strength
persona deviation
instruction conflict
behavioral control
large language models
Innovation

Methods, ideas, or system contributions that make the work stand out.

contrastive decoding
system prompt strength
prompt adherence
behavioral control
training-free steering