GISTBench: Evaluating LLM User Understanding via Evidence-Based Interest Verification

📅 2026-03-30
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the limitation of existing recommendation system benchmarks, which predominantly emphasize item prediction accuracy and fail to adequately evaluate large language models’ (LLMs’) capacity to infer users’ genuine interests from interaction data. To bridge this gap, we introduce GISTBench—the first fine-grained evaluation benchmark tailored for assessing LLMs’ user interest understanding. GISTBench integrates explicit and implicit interaction signals from short-video platforms with textual descriptions to construct a synthetic dataset and proposes two novel metrics: Interest Groundedness (IG) and Interest Specificity (IS). Through user studies and multidimensional evaluation—including precision, recall, and specificity—we conduct experiments on eight open-source LLMs spanning 7B to 120B parameters, revealing critical bottlenecks in accurately attributing and counting user interests across heterogeneous interactions, thereby offering actionable insights for future model refinement.
📝 Abstract
We introduce GISTBench, a benchmark for evaluating Large Language Models' (LLMs) ability to understand users from their interaction histories in recommendation systems. Unlike traditional RecSys benchmarks that focus on item prediction accuracy, our benchmark evaluates how well LLMs can extract and verify user interests from engagement data. We propose two novel metric families: Interest Groundedness (IG), decomposed into precision and recall components to separately penalize hallucinated interest categories and reward coverage, and Interest Specificity (IS), which assesses the distinctiveness of verified LLM-predicted user profiles. We release a synthetic dataset constructed on real user interactions on a global short-form video platform. Our dataset contains both implicit and explicit engagement signals and rich textual descriptions. We validate our dataset fidelity against user surveys, and evaluate eight open-weight LLMs spanning 7B to 120B parameters. Our findings reveal performance bottlenecks in current LLMs, particularly their limited ability to accurately count and attribute engagement signals across heterogeneous interaction types.
Problem

Research questions and friction points this paper is trying to address.

Large Language Models
User Understanding
Interest Verification
Recommendation Systems
Engagement Data
Innovation

Methods, ideas, or system contributions that make the work stand out.

Interest Groundedness
Interest Specificity
LLM User Understanding
Recommendation Systems
Synthetic Benchmark Dataset
🔎 Similar Papers
No similar papers found.
Iordanis Fostiropoulos
Iordanis Fostiropoulos
University of Southern California
Machine LearningArtificial Intelligence
M
Muhammad Rafay Azhar
Meta Recommendation Systems (MRS)
A
Abdalaziz Sawwan
Meta Recommendation Systems (MRS)
B
Boyu Fang
Meta Recommendation Systems (MRS)
Y
Yuchen Liu
Meta Recommendation Systems (MRS)
Jiayi Liu
Jiayi Liu
Meta Platforms
Data ScienceMachine LearningPhysicsCosmology
Hanchao Yu
Hanchao Yu
AI at Meta
Multimodal UnderstandingComputer VisionDeep LearningMedical Image Analysis
Q
Qi Guo
Meta Recommendation Systems (MRS)
Jianyu Wang
Jianyu Wang
Facebook Inc.
Machine learningcomputer visionimage processingperceptual image qualitycolor science
F
Fei Liu
Meta Recommendation Systems (MRS)
X
Xiangjun Fan
Meta Recommendation Systems (MRS)