ConTrans: Weak-to-Strong Alignment Engineering via Concept Transplantation

📅 2024-05-22
🏛️ arXiv.org
📈 Citations: 4
Influential: 0
📄 PDF
🤖 AI Summary
Large language models (LLMs) suffer from high alignment costs, poor generalization, and limited transferability of alignment capabilities across models. Method: This paper proposes “concept transplantation,” a novel framework that enables value-concept transfer from weakly aligned small models to strongly capable yet unaligned large models. It extracts concept vectors via representation engineering and injects alignment knowledge into the target model through affine adaptation and residual stream injection—without full-parameter fine-tuning. The approach supports cross-family and intra-family transfers across multiple scales (e.g., 7B → 13B/70B). Contribution/Results: Experiments across diverse LLM families demonstrate successful alignment knowledge transfer; the method outperforms instruction-tuning baselines on truthfulness metrics, validating its scalability across model sizes and architectural families. This establishes the first effective paradigm for zero-shot, parameter-efficient alignment generalization.

Technology Category

Application Category

📝 Abstract
Ensuring large language models (LLM) behave consistently with human goals, values, and intentions is crucial for their safety but yet computationally expensive. To reduce the computational cost of alignment training of LLMs, especially for those with a huge number of parameters, and to reutilize learned value alignment, we propose ConTrans, a novel framework that enables weak-to-strong alignment transfer via concept transplantation. From the perspective of representation engineering, ConTrans refines concept vectors in value alignment from a source LLM (usually a weak yet aligned LLM). The refined concept vectors are then reformulated to adapt to the target LLM (usually a strong yet unaligned base LLM) via affine transformation. In the third step, ConTrans transplants the reformulated concept vectors into the residual stream of the target LLM. Experiments demonstrate the successful transplantation of a wide range of aligned concepts from 7B models to 13B and 70B models across multiple LLMs and LLM families. Remarkably, ConTrans even surpasses instruction-tuned models in terms of truthfulness. Experiment results validate the effectiveness of both inter-LLM-family and intra-LLM-family concept transplantation. Our work successfully demonstrates an alternative way to achieve weak-to-strong alignment generalization and control.
Problem

Research questions and friction points this paper is trying to address.

Large Language Models
Alignment Problem
Human Intention Understanding
Innovation

Methods, ideas, or system contributions that make the work stand out.

ConTrans Method
Alignment Transfer
Large Language Model Guidance
🔎 Similar Papers
W
Weilong Dong
College of Intelligence and Computing, Tianjin University
X
Xinwei Wu
College of Intelligence and Computing, Tianjin University
Renren Jin
Renren Jin
College of Intelligence and Computing, Tianjin University
Natural Language Processing
S
Shaoyang Xu
School of New Media and Communication, Tianjin University
Deyi Xiong
Deyi Xiong
Professor, College of Intelligence and Computing, Tianjin University, China
Natural Language ProcessingLarge Language ModelsAI4Science