🤖 AI Summary
Existing steering methods for large language models face limitations in fine-grained control, consistency, and usability, hindering their effectiveness in tasks such as style rewriting, user adaptation, and toxicity suppression. This work proposes a training-free, inference-stage logit intervention approach: by constructing a z-normalized token log-odds score table from annotated corpora, the method directly modulates the output distribution prior to decoding, enabling fine-grained generation control that is both training- and task-agnostic. The approach substantially outperforms conventional prompting or activation-based intervention strategies, achieving up to a 47% improvement in accuracy and a 50-fold increase in F1 score on tasks involving complexity, formality, and toxicity control, while maintaining high efficiency and strong consistency.
📝 Abstract
Steering LLMs is essential for specialized applications such as style-sensitive text rewriting, user-adaptive communication, and toxicity mitigation. Current steering methods, such as prompting-based and activation-based approaches, are widely used to guide model behavior. However, activation-based techniques require deep access to internal layers, while prompting-based steering often fails to provide consistent or fine-grained control. In order to address these limitations, we propose a training-free inference-time logit intervention for controllable generation. Our approach utilizes a statistical token score table derived from z-normalized log-odds of labeled corpora to shift the decoding distribution. Empirical evaluations across three diverse datasets focusing on writing complexity, formality, and toxicity demonstrate that our method effectively steers output characteristics, confirming its broad applicability and task-agnostic nature. Our results show that statistically grounded logit steering can achieve large, consistent, and multi-task control gains: up to +47%p accuracy and 50x f1 improvement.