Open Character Training: Shaping the Persona of AI Assistants through Constitutional AI

📅 2025-11-03

📈 Citations: 0

✨ Influential: 0

career value

224K/year

🤖 AI Summary

Despite widespread industrial adoption of persona shaping (“personality training”) for large language models, systematic academic investigation remains lacking. This paper introduces the first open-source, controllable personality training framework for AI assistants, built upon Constitutional AI and leveraging a synthetic introspective data pipeline combined with supervised fine-tuning to enable multi-persona customization. Our method significantly outperforms system prompting and activation steering, achieving stable, authentic, and interference-robust representation across 11 personality dimensions (e.g., humor, empathy, malice) without compromising general capabilities. Key contributions include: (1) the first end-to-end reproducible open-source solution for personality training; (2) a preference-revealing-based evaluation metric for personality consistency; and (3) empirical validation of cross-model generalizability across three major open-source LLMs. Results demonstrate substantial improvements in interaction quality, intent alignment, and value consistency.

Technology Category

Application Category

📝 Abstract

The character of the "AI assistant" persona generated by modern chatbot large language models influences both surface-level behavior and apparent values, beliefs, and ethics. These all affect interaction quality, perceived intelligence, and alignment with both developer and user intentions. The shaping of this persona, known as character training, is a critical component of industry post-training, yet remains effectively unstudied in the academic literature. We introduce the first open implementation of character training, leveraging Constitutional AI and a new data pipeline using synthetic introspective data to shape the assistant persona in a more effective and controlled manner than alternatives such as constraining system prompts or activation steering. Specifically, we fine-tune three popular open-weights models using 11 example personas, such as humorous, deeply caring, or even malevolent. To track the effects of our approach, we introduce a method which analyzes revealed preferences, uncovering clear and holistic changes in character. We find these changes are more robust to adversarial prompting than the above two alternatives, while also leading to more coherent and realistic generations. Finally, we demonstrate this fine-tuning has little to no effect on general capabilities as measured by common benchmarks. We describe and open-source our full post-training method, the implementation of which can be found at https://github.com/maiush/OpenCharacterTraining.

Problem

Research questions and friction points this paper is trying to address.

Shaping AI assistant persona through character training methods

Improving robustness against adversarial prompting in AI assistants

Maintaining general capabilities while modifying assistant characteristics

Innovation

Methods, ideas, or system contributions that make the work stand out.

Leveraging Constitutional AI for character training

Using synthetic introspective data pipeline

Fine-tuning models with multiple persona examples

🔎 Similar Papers

No similar papers found.