Adding Alignment Control to Language Models

📅 2025-03-06

📈 Citations: 0

✨ Influential: 0

career value

180K/year

🤖 AI Summary

This work addresses the limited controllability of large language models (LLMs) in aligning with diverse user preferences. We propose Controllable Alignment Language Models (CLMs), which insert learnable identity layers before the initial transformer layer of a base LLM to construct a preference-aware input embedding space alignment mapping. During inference, alignment strength is continuously and precisely controlled via a linear interpolation mechanism—enabling, for the first time, smooth interpolation and extrapolation of alignment behavior within a single model. CLMs require fine-tuning fewer than 0.1% of the model’s parameters, yet match full-parameter fine-tuning performance. Crucially, alignment strength varies monotonically with the interpolation coefficient, demonstrating that efficient, flexible, and controllable personalized alignment is both theoretically sound and practically viable.

Technology Category

Application Category

📝 Abstract

Post-training alignment has increasingly become a crucial factor in enhancing the usability of language models (LMs). However, the strength of alignment varies depending on individual preferences. This paper proposes a method to incorporate alignment control into a single model, referred to as CLM. This approach adds one identity layer preceding the initial layers and performs preference learning only on this layer to map unaligned input token embeddings into the aligned space. Experimental results demonstrate that this efficient fine-tuning method performs comparable to full fine-tuning. During inference, the input embeddings are processed through the aligned and unaligned layers, which are then merged through the interpolation coefficient. By controlling this parameter, the alignment exhibits a clear interpolation and extrapolation phenomenon.

Problem

Research questions and friction points this paper is trying to address.

Enhance language model usability via alignment control.

Propose CLM method for efficient preference learning.

Control alignment strength through interpolation coefficient.

Innovation

Methods, ideas, or system contributions that make the work stand out.

Adds identity layer for alignment control

Performs preference learning on single layer

Uses interpolation coefficient for alignment adjustment

🔎 Similar Papers

Is Free Self-Alignment Possible?