🤖 AI Summary
This work addresses the limited ability of large language models to adhere to system prompts that conflict with their default “helpful assistant” role. The authors propose a training-free, dynamic control method that, for the first time, models prompt adherence as a continuously adjustable variable. By contrasting the output logits of the target and default prompts and scaling their difference with a scalar factor α, the approach isolates and amplifies the desired behavioral signal. Built upon contrastive decoding, the method achieves significant performance gains across five benchmarks—including IFEval, OffTopicEval, and Prompt-Steering—yielding an 8.5% improvement in strict accuracy, a 45-percentage-point increase in refusal rate, and a 13% gain in controllability, thereby overcoming the constraints imposed by fixed-role assumptions.
📝 Abstract
Large language models excel at complex instructions yet struggle to deviate from their helpful assistant persona, as post-training instills strong priors that resist conflicting instructions. We introduce system prompt strength, a training-free method that treats prompt adherence as a continuous control. By contrasting logits from target and default system prompts, we isolate and amplify the behavioral signal unique to the target persona by a scalar factor alpha. Across five diverse benchmarks spanning constraint satisfaction, behavioral control, pluralistic alignment, capability modulation, and stylistic control, our method yields substantial improvements: up to +8.5 strict accuracy on IFEval, +45pp refusal rate on OffTopicEval, and +13% steerability on Prompt-Steering. Our approach enables practitioners to modulate system prompt strength, providing dynamic control over model behavior without retraining.