What's Producible May Not Be Reachable: Measuring the Steerability of Generative Models

📅 2025-03-21
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the lack of systematic evaluation of *steerability*—the ability of generative models to produce precise target outputs via user interaction—by formally distinguishing and decoupling it from *producibility* (i.e., whether a model can generate a given content class). We propose a goal-driven, user-reproduction evaluation paradigm and construct a large-scale benchmark spanning text-to-image generation and large language models. Our framework integrates human evaluation, targeted sampling protocols, controlled interactive experiments, and reinforcement learning (RL) optimization specifically designed for steerability. Empirical analysis reveals that state-of-the-art models exhibit consistently weak steerability. Applying our framework, we optimize image generation guidance mechanisms via RL, achieving over 2× improvement in steerability performance on the benchmark. Our core contributions are: (1) establishing the first formal evaluation paradigm for steerability, and (2) providing the first reproducible, scalable, and extensible benchmarking suite for this critical capability.

Technology Category

Application Category

📝 Abstract
How should we evaluate the quality of generative models? Many existing metrics focus on a model's producibility, i.e. the quality and breadth of outputs it can generate. However, the actual value from using a generative model stems not just from what it can produce but whether a user with a specific goal can produce an output that satisfies that goal. We refer to this property as steerability. In this paper, we first introduce a mathematical framework for evaluating steerability independently from producibility. Steerability is more challenging to evaluate than producibility because it requires knowing a user's goals. We address this issue by creating a benchmark task that relies on one key idea: sample an output from a generative model and ask users to reproduce it. We implement this benchmark in a large-scale user study of text-to-image models and large language models. Despite the ability of these models to produce high-quality outputs, they all perform poorly on steerabilty. This suggests that we need to focus on improving the steerability of generative models. We show such improvements are indeed possible: through reinforcement learning techniques, we create an alternative steering mechanism for image models that achieves more than 2x improvement on this benchmark.
Problem

Research questions and friction points this paper is trying to address.

Evaluating generative models' steerability versus producibility
Measuring user goal satisfaction in generative model outputs
Improving steerability via reinforcement learning techniques
Innovation

Methods, ideas, or system contributions that make the work stand out.

Introducing mathematical framework for steerability evaluation
Creating benchmark task with user reproduction study
Improving steerability via reinforcement learning techniques
🔎 Similar Papers
No similar papers found.