Adversarial Alignment: Ensuring Value Consistency in Large Language Models for Sensitive Domains

๐Ÿ“… 2026-01-19
๐Ÿ“ˆ Citations: 0
โœจ Influential: 0
๐Ÿ“„ PDF
๐Ÿค– AI Summary
This work addresses the pervasive issue of value inconsistencies and biases exhibited by large language models (LLMs) in sensitive domains such as race, society, and politics. To mitigate this, the authors propose an adversarial alignment framework featuring a novel Attacker-Actor-Critic tripartite architecture: the Attacker generates contentious queries, the Actor produces value-aligned responses, and the Critic filters out low-quality outputs. The framework is optimized through a combination of continual pretraining, instruction tuning, and adversarial training. Additionally, the study introduces the first bilingual (Chineseโ€“English) benchmark dataset for evaluating value alignment in sensitive contexts. Experimental results demonstrate that the resulting model, VC-LLM, significantly outperforms prevailing LLMs in both languages, achieving markedly improved value consistency in sensitive scenarios.

Technology Category

Application Category

๐Ÿ“ Abstract
With the wide application of large language models (LLMs), the problems of bias and value inconsistency in sensitive domains have gradually emerged, especially in terms of race, society and politics. In this paper, we propose an adversarial alignment framework, which enhances the value consistency of the model in sensitive domains through continued pre-training, instruction fine-tuning and adversarial training. In adversarial training, we use the Attacker to generate controversial queries, the Actor to generate responses with value consistency, and the Critic to filter and ensure response quality. Furthermore, we train a Value-Consistent Large Language Model, VC-LLM, for sensitive domains, and construct a bilingual evaluation dataset in Chinese and English. The experimental results show that VC-LLM performs better than the existing mainstream models in both Chinese and English tests, verifying the effectiveness of the method. Warning: This paper contains examples of LLMs that are offensive or harmful in nature.
Problem

Research questions and friction points this paper is trying to address.

bias
value inconsistency
large language models
sensitive domains
Innovation

Methods, ideas, or system contributions that make the work stand out.

Adversarial Alignment
Value Consistency
Large Language Models
Sensitive Domains
Adversarial Training
๐Ÿ”Ž Similar Papers
No similar papers found.
Y
Yuan Gao
School of Information Engineering, Minzu University of China, Beijing 100081, China; National Language Resource Monitoring and Research Center of Minority Languages, Beijing 100081, China
Z
Zhigang Liu
School of Information Engineering, Minzu University of China, Beijing 100081, China; National Language Resource Monitoring and Research Center of Minority Languages, Beijing 100081, China
X
Xinyu Yao
School of Information Engineering, Minzu University of China, Beijing 100081, China; National Language Resource Monitoring and Research Center of Minority Languages, Beijing 100081, China
Bo Chen
Bo Chen
Minzu University of China
Natural Language ProcessingSemantic ParsingLLMsStance Detection
X
Xiaobing Zhao
School of Information Engineering, Minzu University of China, Beijing 100081, China; National Language Resource Monitoring and Research Center of Minority Languages, Beijing 100081, China