SDA: Steering-Driven Distribution Alignment for Open LLMs without Fine-Tuning

📅 2025-11-20
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address the challenge of cost-effective and flexible human-alignment for large language models (LLMs) during inference—without requiring fine-tuning, training, or model-specific modifications—this paper introduces SDA, an open-source, training-free, model-agnostic alignment framework. SDA enables dynamic, instruction-guided output probability redistribution at inference time via user-defined steering instructions: it employs steering vectors to recalibrate token probabilities and incorporates an instruction-driven output-space control mechanism for fine-grained, personalized preference specification. Extensive experiments across eight open-source LLMs of diverse architectures and scales demonstrate that SDA consistently improves helpfulness (+64.4%), truthfulness (+30.0%), and harmlessness (+11.5%) on average. The framework exhibits strong generalization, zero-shot plug-and-play deployment capability, and compatibility with existing alignment methods—either as a standalone solution or as a complementary enhancement.

Technology Category

Application Category

📝 Abstract
With the rapid advancement of large language models (LLMs), their deployment in real-world applications has become increasingly widespread. LLMs are expected to deliver robust performance across diverse tasks, user preferences, and practical scenarios. However, as demands grow, ensuring that LLMs produce responses aligned with human intent remains a foundational challenge. In particular, aligning model behavior effectively and efficiently during inference, without costly retraining or extensive supervision, is both a critical requirement and a non-trivial technical endeavor. To address the challenge, we propose SDA (Steering-Driven Distribution Alignment), a training-free and model-agnostic alignment framework designed for open-source LLMs. SDA dynamically redistributes model output probabilities based on user-defined alignment instructions, enhancing alignment between model behavior and human intents without fine-tuning. The method is lightweight, resource-efficient, and compatible with a wide range of open-source LLMs. It can function independently during inference or be integrated with training-based alignment strategies. Moreover, SDA supports personalized preference alignment, enabling flexible control over the model response behavior. Empirical results demonstrate that SDA consistently improves alignment performance across 8 open-source LLMs with varying scales and diverse origins, evaluated on three key alignment dimensions, helpfulness, harmlessness, and honesty (3H). Specifically, SDA achieves average gains of 64.4% in helpfulness, 30% in honesty and 11.5% in harmlessness across the tested models, indicating its effectiveness and generalization across diverse models and application scenarios.
Problem

Research questions and friction points this paper is trying to address.

Aligning LLM responses with human intent without fine-tuning
Enhancing model behavior across helpfulness, harmlessness, and honesty
Enabling personalized preference alignment during inference efficiently
Innovation

Methods, ideas, or system contributions that make the work stand out.

Training-free alignment framework for open-source LLMs
Dynamic probability redistribution using alignment instructions
Model-agnostic method enhancing helpfulness, honesty, harmlessness
🔎 Similar Papers
No similar papers found.
W
Wei Xia
State Key Laboratory of General Artificial Intelligence, School of Intelligence Science and Technology, Peking University
Zhi-Hong Deng
Zhi-Hong Deng
School of Intelligence Science and Technology, Peking University
deep learningNLPdata/text mining