Can Revealed Preferences Clarify LLM Alignment and Steering?

📅 2026-05-08

📈 Citations: 0

✨ Influential: 0

career value

194K/year

🤖 AI Summary

This study addresses the opacity of large language models’ preferences in high-stakes decision-making, which impedes rigorous evaluation of their alignment and controllability. The work introduces revealed preference theory into the analysis of language models for the first time, establishing an empirical framework grounded in discrete choice models. By sampling probabilistic outputs and inverting cost functions, the method infers latent preferences directly from model behavior. Empirical evaluation across multiple medical diagnosis tasks demonstrates that while mainstream models exhibit a degree of internal consistency, they display significant deficiencies in accurately articulating their own preferences and adapting them in response to prompts. This approach offers a quantifiable and verifiable paradigm for assessing and guiding the alignment of large language models.

📝 Abstract

LLMs are increasingly used to make or support high-stakes decisions under uncertainty, where alignment depends not only on factual accuracy but on how models weigh tradeoffs between different outcomes. We present an empirical pipeline for estimating the implied preferences that an LLM's observed choices optimize: we elicit the model's probability distribution over unknowns along with the choice it would make for the decision task and then fit a discrete choice model to recover the cost function that best rationalizes the model's decisions. We show how this revealed-preference description allows rigorous evaluation of whether models behave in a consistently goal-directed way, whether they can verbalize a description of their objectives which matches their revealed decision policy, and whether prompting can reliably steer those policies to implement a user-specified cost function. We apply this evaluation across four medical diagnosis domains and multiple frontier and open-source models. We find that while many models have a nontrivial degree of internal coherence, they also have significant weaknesses in faithfully reporting or adopting preferences in response to user direction.

Problem

Research questions and friction points this paper is trying to address.

LLM alignment

revealed preferences

decision-making under uncertainty

preference elicitation

model steering

Innovation

Methods, ideas, or system contributions that make the work stand out.

revealed preferences

LLM alignment

discrete choice modeling