๐ค AI Summary
This study addresses the limitations of current language model preference evaluations, which predominantly rely on forced-choice binary prompts that may conflate genuine preferences with artifacts induced by the elicitation protocol. The authors systematically examine consistency between stated preferences and revealed choices (SvR) across 24 language models under varying elicitation protocols, introducing neutral and abstention options to capture weak or uncertain preferences, alongside system prompt interventions. Using the AIRiskDilemmas dataset, they conduct multi-model comparisons and Spearman rank correlation analyses. Results show that permitting neutral stated preferences significantly enhances SvR correlation, whereas allowing abstention during preference revelation reduces correlation to near-zero or even negative levels. System prompts fail to consistently improve alignment. These findings underscore the high sensitivity of SvR consistency to elicitation design, challenging the conventional forced-binary paradigm and advocating for evaluation frameworks that accommodate preference uncertainty.
๐ Abstract
Recent work identifies a stated-revealed (SvR) preference gap in language models (LMs): a mismatch between the values models endorse and the choices they make in context. Existing evaluations rely heavily on binary forced-choice prompting, which entangles genuine preferences with artifacts of the elicitation protocol. We systematically study how elicitation protocols affect SvR correlation across 24 LMs. Allowing neutrality and abstention during stated preference elicitation allows us to exclude weak signals, substantially improving Spearman's rank correlation ($ฯ$) between volunteered stated preferences and forced-choice revealed preferences. However, further allowing abstention in revealed preferences drives $ฯ$ to near-zero or negative values due to high neutrality rates. Finally, we find that system prompt steering using stated preferences during revealed preference elicitation does not reliably improve SvR correlation on AIRiskDilemmas. Together, our results show that SvR correlation is highly protocol-dependent and that preference elicitation requires methods that account for indeterminate preferences.