LaMP-QA: A Benchmark for Personalized Long-form Question Answering

📅 2025-05-30

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

The absence of high-quality evaluation benchmarks for personalized long-answer generation hinders progress in user-context-aware question answering. Method: This paper introduces LaMP-QA, the first open-source benchmark specifically designed for personalized long-text QA, covering three broad domains—art & entertainment, lifestyle, and socio-cultural topics—with over 45 fine-grained subcategories. We formally define the personalized long-answer generation task and propose a multi-dimensional human–machine collaborative evaluation framework integrating expert annotation with automated metrics (BLEU, BERTScore, and Personalization Alignment Score). Contribution/Results: Experiments demonstrate that incorporating user-specific contextual signals improves answer relevance and preference alignment by up to 39%. LaMP-QA has become the de facto standard evaluation platform for this research direction, significantly advancing both the development and rigorous assessment of personalized generative models.

Technology Category

Application Category

📝 Abstract

Personalization is essential for question answering systems that are user-centric. Despite its importance, personalization in answer generation has been relatively underexplored. This is mainly due to lack of resources for training and evaluating personalized question answering systems. We address this gap by introducing LaMP-QA -- a benchmark designed for evaluating personalized long-form answer generation. The benchmark covers questions from three major categories: (1) Arts&Entertainment, (2) Lifestyle&Personal Development, and (3) Society&Culture, encompassing over 45 subcategories in total. To assess the quality and potential impact of the LaMP-QA benchmark for personalized question answering, we conduct comprehensive human and automatic evaluations, to compare multiple evaluation strategies for evaluating generated personalized responses and measure their alignment with human preferences. Furthermore, we benchmark a number of non-personalized and personalized approaches based on open-source and proprietary large language models (LLMs). Our results show that incorporating the personalized context provided leads to performance improvements of up to 39%. The benchmark is publicly released to support future research in this area.

Problem

Research questions and friction points this paper is trying to address.

Lack of resources for personalized QA system training

Need for evaluating personalized long-form answer generation

Measuring performance impact of personalization in QA systems

Innovation

Methods, ideas, or system contributions that make the work stand out.

Introduces LaMP-QA benchmark for personalized QA

Evaluates personalized long-form answer generation

Uses open-source and proprietary LLMs

🔎 Similar Papers

No similar papers found.

Authors to Follow