Language Models Don't Know What You Want: Evaluating Personalization in Deep Research Needs Real Users

📅 2026-03-17

📈 Citations: 0

✨ Influential: 0

career value

195K/year

🤖 AI Summary

Existing deep research tools struggle to capture users’ authentic research needs and lack effective personalization. This work proposes MyScholarQA, a system that models user profiles and leverages large language models (LLMs) to generate personalized procedural recommendations and multi-section research reports, thereby offering tailored support for in-depth scholarly inquiry. The study reveals, for the first time, that relying solely on LLM-based automatic evaluation overlooks critical flaws in personalized contexts. Through a dual evaluation framework—combining a synthetic user benchmark and real-user interviews—the system demonstrates superior performance over baselines in synthetic assessments, while qualitative feedback from actual users uncovers nine categories of personalized errors invisible to LLM evaluators, offering crucial qualitative insights for future system design.

Technology Category

Application Category

📝 Abstract

Deep Research (DR) tools (e.g. OpenAI DR) help researchers cope with ballooning publishing counts. Such tools can synthesize scientific papers to answer researchers' queries, but lack understanding of their users. We change that in MyScholarQA (MySQA), a personalized DR tool that: 1) infers a profile of a user's research interests; 2) proposes personalized actions for a user's input query; and 3) writes a multi-section report for the query that follows user-approved actions. We first test MySQA with NLP's standard protocol: we design a benchmark of synthetic users and LLM judges, where MySQA beats baselines in citation metrics and personalized action-following. However, we suspect this process does not cover all aspects of personalized DR users value, so we interview users in an online version of MySQA to unmask them. We reveal nine nuanced errors of personalized DR undetectable by our LLM judges, and we study qualitative feedback to form lessons for future DR design. In all, we argue for a pillar of personalization that easy-to-use LLM judges can lead NLP to overlook: real progress in personalization is only possible with real users.

Problem

Research questions and friction points this paper is trying to address.

personalization

Deep Research

real users

language models

user evaluation

Innovation

Methods, ideas, or system contributions that make the work stand out.

personalized deep research

user modeling

real user evaluation

LLM-based synthesis

action-aware reporting

🔎 Similar Papers

Exploring Safety-Utility Trade-Offs in Personalized Language Models

2024-06-17arXiv.orgCitations: 1