🤖 AI Summary
Large language models (LLMs) suffer from high alignment costs, poor generalization, and limited transferability of alignment capabilities across models. Method: This paper proposes “concept transplantation,” a novel framework that enables value-concept transfer from weakly aligned small models to strongly capable yet unaligned large models. It extracts concept vectors via representation engineering and injects alignment knowledge into the target model through affine adaptation and residual stream injection—without full-parameter fine-tuning. The approach supports cross-family and intra-family transfers across multiple scales (e.g., 7B → 13B/70B). Contribution/Results: Experiments across diverse LLM families demonstrate successful alignment knowledge transfer; the method outperforms instruction-tuning baselines on truthfulness metrics, validating its scalability across model sizes and architectural families. This establishes the first effective paradigm for zero-shot, parameter-efficient alignment generalization.
📝 Abstract
Ensuring large language models (LLM) behave consistently with human goals, values, and intentions is crucial for their safety but yet computationally expensive. To reduce the computational cost of alignment training of LLMs, especially for those with a huge number of parameters, and to reutilize learned value alignment, we propose ConTrans, a novel framework that enables weak-to-strong alignment transfer via concept transplantation. From the perspective of representation engineering, ConTrans refines concept vectors in value alignment from a source LLM (usually a weak yet aligned LLM). The refined concept vectors are then reformulated to adapt to the target LLM (usually a strong yet unaligned base LLM) via affine transformation. In the third step, ConTrans transplants the reformulated concept vectors into the residual stream of the target LLM. Experiments demonstrate the successful transplantation of a wide range of aligned concepts from 7B models to 13B and 70B models across multiple LLMs and LLM families. Remarkably, ConTrans even surpasses instruction-tuned models in terms of truthfulness. Experiment results validate the effectiveness of both inter-LLM-family and intra-LLM-family concept transplantation. Our work successfully demonstrates an alternative way to achieve weak-to-strong alignment generalization and control.