🤖 AI Summary
This paper addresses the lack of evaluation frameworks for assessing the “controllability” of large language models (LLMs) in diverse community contexts. Methodologically, it introduces the first community-specific instruction-alignment benchmark: leveraging 30 contrastive Reddit subreddits spanning 19 domains, it constructs real-world, corpus-driven instruction–response pairs; formally defines and quantifies *community-sensitive controllability*; and incorporates silver-standard multiple-choice questions, adversarial instruction design, cross-group consistency scoring, and perturbation robustness testing. Key contributions include establishing a fine-grained, scalable diagnostic framework for community-aware alignment. Evaluation across 13 state-of-the-art LLMs reveals that the best-performing model achieves only 65% average accuracy—significantly below human expert performance (81%)—exposing systematic deficiencies in cultural awareness, ideological adaptability, and resistance to manipulative or misleading instructions.
📝 Abstract
Steerability, or the ability of large language models (LLMs) to adapt outputs to align with diverse community-specific norms, perspectives, and communication styles, is critical for real-world applications but remains under-evaluated. We introduce Steer-Bench, a benchmark for assessing population-specific steering using contrasting Reddit communities. Covering 30 contrasting subreddit pairs across 19 domains, Steer-Bench includes over 10,000 instruction-response pairs and validated 5,500 multiple-choice question with corresponding silver labels to test alignment with diverse community norms. Our evaluation of 13 popular LLMs using Steer-Bench reveals that while human experts achieve an accuracy of 81% with silver labels, the best-performing models reach only around 65% accuracy depending on the domain and configuration. Some models lag behind human-level alignment by over 15 percentage points, highlighting significant gaps in community-sensitive steerability. Steer-Bench is a benchmark to systematically assess how effectively LLMs understand community-specific instructions, their resilience to adversarial steering attempts, and their ability to accurately represent diverse cultural and ideological perspectives.