🤖 AI Summary
Large language models (LLMs) exhibit strong generalization capabilities but struggle to infer users’ implicit preferences, raising the fundamental question of whether conversational interaction can effectively uncover latent user needs.
Method: We introduce the first systematic, multi-task benchmark for preference inference—comprising 20 Questions, personalized question answering, and text summarization—spanning progressively complex scenarios. Our evaluation framework employs a fine-grained, three-agent interaction protocol (user, assistant, judge) with multi-turn dialogue and context-aware per-turn assessment.
Contribution/Results: Our approach quantifies LLMs’ ability to elicit implicit user attributes across tasks for the first time, revealing substantial performance variation (32%–98%) and pronounced context sensitivity. The benchmark enables reproducible, modular analysis of preference discovery in personalized human-AI interaction, establishing an empirical foundation for future research on adaptive, user-centered dialogue systems.
📝 Abstract
Large Language Models (LLMs) excel at producing broadly relevant text, but this generality becomes a limitation when user-specific preferences are required, such as recommending restaurants or planning travel. In these scenarios, users rarely articulate every preference explicitly; instead, much of what they care about remains latent, waiting to be inferred. This raises a fundamental question: Can LLMs uncover and reason about such latent information through conversation?
We address this problem by introducing a unified benchmark for evaluating latent information discovery - the ability of LLMs to reveal and utilize hidden user attributes through multi-turn interaction. The benchmark spans three progressively realistic settings: the classic 20 Questions game, Personalized Question Answering, and Personalized Text Summarization. All tasks share a tri-agent framework (User, Assistant, Judge) enabling turn-level evaluation of elicitation and adaptation. Our results reveal that while LLMs can indeed surface latent information through dialogue, their success varies dramatically with context: from 32% to 98%, depending on task complexity, topic, and number of hidden attributes. This benchmark provides the first systematic framework for studying latent information discovery in personalized interaction, highlighting that effective preference inference remains an open frontier for building truly adaptive AI systems.