🤖 AI Summary
Large language models often exhibit undesirable behaviors—such as intent misalignment and personality inconsistency—in sensitive social contexts, underscoring the urgent need for a systematic evaluation framework. This work proposes SteerEval, the first three-tiered controllability benchmark that spans multiple behavioral granularities, unifying assessment across linguistic features, affective states, and personality traits through three hierarchical levels: L1 (content), L2 (style), and L3 (realization). By integrating hierarchical behavioral modeling, multidimensional metrics, and a systematic comparison of mainstream steering methods, the study reveals a significant performance drop in existing approaches at fine-grained levels. These findings demonstrate both the necessity and effectiveness of SteerEval in advancing research toward safe, interpretable, and controllable large language models.
📝 Abstract
Large Language Models (LLMs) are increasingly deployed in socially sensitive domains, yet their unpredictable behaviors, ranging from misaligned intent to inconsistent personality, pose significant risks. We introduce SteerEval, a hierarchical benchmark for evaluating LLM controllability across three domains: language features, sentiment, and personality. Each domain is structured into three specification levels: L1 (what to express), L2 (how to express), and L3 (how to instantiate), connecting high-level behavioral intent to concrete textual output. Using SteerEval, we systematically evaluate contemporary steering methods, revealing that control often degrades at finer-grained levels. Our benchmark offers a principled and interpretable framework for safe and controllable LLM behavior, serving as a foundation for future research.