Evaluating the Prompt Steerability of Large Language Models

📅 2024-11-19

🏛️ arXiv.org

📈 Citations: 0

✨ Influential: 0

career value

233K/year

🤖 AI Summary

This study investigates the value- and culture-controllability of large language models (LLMs) under prompt-based steering, aiming to advance AI systems that support pluralistic values. To this end, we introduce the first benchmarking framework explicitly designed for prompt steerability: we formally define steerability via joint behavioral distribution shift and propose a differentiable, quantitative metric. Through multidimensional persona distillation, systematic prompt perturbation experiments, and statistical distribution analysis, we uncover pervasive systemic limitations in mainstream LLMs—including baseline behavioral skew and asymmetric steerability across cultural and value dimensions. We release an open-source, reproducible evaluation toolkit enabling cross-model and cross-task assessment of steering capability and fairness. This work establishes both theoretical foundations and practical standards for developing controllable, trustworthy, and value-pluralistic AI systems.

Technology Category

Application Category

📝 Abstract

Building pluralistic AI requires designing models that are able to be shaped to represent a wide range of value systems and cultures. Achieving this requires first being able to evaluate the degree to which a given model is capable of reflecting various personas. To this end, we propose a benchmark for evaluating the steerability of model personas as a function of prompting. Our design is based on a formal definition of prompt steerability, which analyzes the degree to which a model's joint behavioral distribution can be shifted from its baseline. By defining steerability indices and inspecting how these indices change as a function of steering effort, we can estimate the steerability of a model across various persona dimensions and directions. Our benchmark reveals that the steerability of many current models is limited -- due to both a skew in their baseline behavior and an asymmetry in their steerability across many persona dimensions. We release an implementation of our benchmark at https://github.com/IBM/prompt-steering.

Problem

Research questions and friction points this paper is trying to address.

Evaluate large language models' prompt steerability.

Assess models' ability to reflect diverse personas.

Develop benchmark for model steerability evaluation.

Innovation

Methods, ideas, or system contributions that make the work stand out.

Benchmark for model steerability evaluation

Formal definition of prompt steerability

Steerability indices across persona dimensions

🔎 Similar Papers

No similar papers found.