🤖 AI Summary
Existing large language models (LLMs) decouple task solving from preference alignment, rendering them ineffective in cold-start scenarios—such as privacy-sensitive or real-time interactions—where no user history is available to infer individual preferences.
Method: We propose “personalized reasoning” as a new paradigm: LLMs must proactively solicit clarifying questions to uncover latent preferences and dynamically adapt both their reasoning chains and final responses. To rigorously evaluate this capability, we introduce PREFDISCO, a psychology-driven benchmark that transforms static tasks into interactive evaluations grounded in psychological personas.
Contribution/Results: PREFDISCO systematically assesses 21 state-of-the-art models across 10 diverse tasks. Results reveal that 29.0% of naive personalization attempts underperform generic responses—demonstrating that personalized reasoning does not emerge spontaneously. This work uncovers a fundamental limitation of current LLMs in cold-start settings and establishes the first quantifiable, optimization-ready framework for advancing personalized reasoning in high-stakes domains such as education and healthcare.
📝 Abstract
Current large language model (LLM) development treats task-solving and preference alignment as separate challenges, optimizing first for objective correctness, then for alignment to aggregated human preferences. This paradigm fails in human-facing applications where solving a problem correctly is insufficient if the response mismatches the user's needs. This challenge intensifies in just-in-time scenarios where no prior user interaction history exists due to cold-start conditions or privacy constraints. LLMs need to identify what they don't know about user preferences, strategically elicit preference values through questioning, then adapt their reasoning processes and responses accordingly -- a complicated chain of cognitive processes which we term personalized reasoning. We introduce PREFDISCO, an evaluation methodology that transforms static benchmarks into interactive personalization tasks using psychologically-grounded personas with sparse preferences. Our framework creates scenarios where identical questions require different reasoning chains depending on user context, as optimal explanation approaches vary by individual expertise and preferences while maintaining factual accuracy. Evaluation of 21 frontier models across 10 tasks reveals 29.0% of naive personalization attempts produce worse preference alignment than generic responses, yet generic responses also fail to serve individual user needs effectively. These findings suggest personalized reasoning requires dedicated development rather than emerging naturally. PREFDISCO establishes personalized reasoning as a measurable research frontier and reveals fundamental limitations in current LLMs' interactive capabilities, providing a foundation for developing systems that can adapt to individual users in education, healthcare, and technical domains where personalization is critical.