A Course Correction in Steerability Evaluation: Revealing Miscalibration and Side Effects in LLMs

📅 2025-05-27

📈 Citations: 0

✨ Influential: 0

career value

184K/year

🤖 AI Summary

This paper investigates the “controllability” of large language models (LLMs)—i.e., their ability to reliably satisfy multi-dimensional user objectives (e.g., readability, tone). It identifies three pervasive alignment failures: coverage gaps (lack of support for rare attributes), miscalibration (outputs exceeding specified attribute ranges), and cross-dimensional side effects (unintended perturbations to other attributes when modifying one). Method: We introduce the first text-attribute-vector-based evaluation framework for multi-dimensional controllability, formally defining and empirically characterizing miscalibration and side effects. Using multi-objective modeling, quantitative attribute representation, prompt engineering, best-of-N sampling, and RLHF fine-tuning, we conduct controlled experiments on text rewriting. Results: Experiments reveal persistent, robust side effects across mainstream LLMs; existing interventions yield only marginal improvements. Our findings demonstrate that current alignment strategies remain insufficient for ensuring reliable, fine-grained controllable generation.

Technology Category

Application Category

📝 Abstract

Despite advances in large language models (LLMs) on reasoning and instruction-following benchmarks, it remains unclear whether they can reliably produce outputs aligned with a broad variety of user goals, a concept we refer to as steerability. The abundance of methods proposed to modify LLM behavior makes it unclear whether current LLMs are already steerable, or require further intervention. In particular, LLMs may exhibit (i) poor coverage, where rare user goals are underrepresented; (ii) miscalibration, where models overshoot requests; and (iii) side effects, where changes to one dimension of text inadvertently affect others. To systematically evaluate these failures, we introduce a framework based on a multi-dimensional goal space that models user goals and LLM outputs as vectors with dimensions corresponding to text attributes (e.g., reading difficulty). Applied to a text-rewriting task, we find that current LLMs struggle with steerability, as side effects are persistent. Interventions to improve steerability, such as prompt engineering, best-of-$N$ sampling, and reinforcement learning fine-tuning, have varying effectiveness, yet side effects remain problematic. Our findings suggest that even strong LLMs struggle with steerability, and existing alignment strategies may be insufficient. We open-source our steerability evaluation framework at https://github.com/MLD3/steerability.

Problem

Research questions and friction points this paper is trying to address.

Evaluating LLM steerability and alignment with diverse user goals

Identifying miscalibration, poor coverage, and side effects in LLMs

Assessing effectiveness of interventions to improve LLM steerability

Innovation

Methods, ideas, or system contributions that make the work stand out.

Multi-dimensional goal space framework

Prompt engineering interventions

Reinforcement learning fine-tuning

🔎 Similar Papers

Does Alignment Tuning Really Break LLMs' Internal Confidence?