Improving and Assessing the Fidelity of Large Language Models Alignment to Online Communities

📅 2024-08-18

🏛️ arXiv.org

📈 Citations: 0

✨ Influential: 0

career value

220K/year

🤖 AI Summary

This study addresses the challenge of value alignment between large language models (LLMs) and specific online communities—particularly those centered on diet and body image—focusing on fidelity assessment across critical dimensions: linguistic authenticity, affective valence, toxicity, and harmfulness. We propose the first multidimensional alignment evaluation framework integrating clinical psychometric instruments (e.g., the Eating Disorder Examination Questionnaire) into LLM safety assessment, thereby bridging psychological measurement theory and AI safety. Our methodology combines community-specific instruction fine-tuning, curated domain-adapted training data, and multi-granularity linguistic evaluation metrics. Experimental results demonstrate that the framework effectively discriminates pathological beliefs expressed in model outputs across varying-risk communities, achieving high discriminant validity. The approach shows strong promise for automated content moderation and public health interventions targeting disordered eating and body image concerns.

Technology Category

Application Category

📝 Abstract

Large language models (LLMs) have shown promise in representing individuals and communities, offering new ways to study complex social dynamics. However, effectively aligning LLMs with specific human groups and systematically assessing the fidelity of the alignment remains a challenge. This paper presents a robust framework for aligning LLMs with online communities via instruction-tuning and comprehensively evaluating alignment across various aspects of language, including authenticity, emotional tone, toxicity, and harm. We demonstrate the utility of our approach by applying it to online communities centered on dieting and body image. We administer an eating disorder psychometric test to the aligned LLMs to reveal unhealthy beliefs and successfully differentiate communities with varying levels of eating disorder risk. Our results highlight the potential of LLMs in automated moderation and broader applications in public health and social science research.

Problem

Research questions and friction points this paper is trying to address.

Aligning LLMs with online communities

Assessing language model fidelity

Evaluating health risks in communities

Innovation

Methods, ideas, or system contributions that make the work stand out.

Instruction-tuning for community alignment

Comprehensive language aspect evaluation

Psychometric tests for community differentiation

🔎 Similar Papers

No similar papers found.