Think Again! The Effect of Test-Time Compute on Preferences, Opinions, and Beliefs of Large Language Models

📅 2025-05-26

📈 Citations: 0

✨ Influential: 0

career value

200K/year

🤖 AI Summary

This study investigates whether large language models (LLMs) internalize and overtly express subjective preferences, opinions, and beliefs (POBs) across social, cultural, ethical, and personal dimensions—and how such expression compromises neutrality, reliability, and consistency. To this end, we construct the first multi-dimensional POB benchmark, integrating human-crafted, cross-domain question sets with chain-of-thought reasoning, self-reflection, and multi-round sampling consistency analysis. Experimental results reveal a pronounced degradation in neutrality and consistency across mainstream LLMs, with newer model versions exhibiting exacerbated biases. Test-time computation enhancements—such as CoT and introspection—yield only marginal improvements, exposing fundamental limitations of current alignment techniques at the values level. This work provides the first systematic quantification of LLMs’ subjective倾向ness, establishing a novel paradigm and empirical foundation for value alignment evaluation.

Technology Category

Application Category

📝 Abstract

As Large Language Models (LLMs) become deeply integrated into human life and increasingly influence decision-making, it's crucial to evaluate whether and to what extent they exhibit subjective preferences, opinions, and beliefs. These tendencies may stem from biases within the models, which may shape their behavior, influence the advice and recommendations they offer to users, and potentially reinforce certain viewpoints. This paper presents the Preference, Opinion, and Belief survey (POBs), a benchmark developed to assess LLMs' subjective inclinations across societal, cultural, ethical, and personal domains. We applied our benchmark to evaluate leading open- and closed-source LLMs, measuring desired properties such as reliability, neutrality, and consistency. In addition, we investigated the effect of increasing the test-time compute, through reasoning and self-reflection mechanisms, on those metrics. While effective in other tasks, our results show that these mechanisms offer only limited gains in our domain. Furthermore, we reveal that newer model versions are becoming less consistent and more biased toward specific viewpoints, highlighting a blind spot and a concerning trend. POBS: https://ibm.github.io/POBS

Problem

Research questions and friction points this paper is trying to address.

Assessing LLMs' subjective preferences, opinions, and beliefs

Evaluating biases in LLMs affecting advice and recommendations

Investigating test-time compute impact on model reliability and neutrality

Innovation

Methods, ideas, or system contributions that make the work stand out.

Developed POBs benchmark for LLM evaluation

Tested reasoning and self-reflection mechanisms

Assessed bias and consistency in LLMs

🔎 Similar Papers

No similar papers found.