AdvisorQA: Towards Helpful and Harmless Advice-seeking Question Answering with Collective Intelligence

📅 2024-04-18

🏛️ arXiv.org

📈 Citations: 0

✨ Influential: 0

career value

178K/year

🤖 AI Summary

Existing large language models lack standardized, verifiable quality evaluation metrics for generating advice on subjective personal dilemmas. Method: This paper introduces LifeProTips-Bench—the first benchmark for advice-oriented question answering grounded in real-world personal dilemmas—curated from Reddit’s LifeProTips community. It leverages user-submitted questions, multi-source responses, and collective upvote/downvote data to establish a “collective wisdom” evaluation paradigm that jointly models deep subjective understanding and empathy, moving beyond superficial trade-offs between helpfulness and harmlessness typical in conventional QA benchmarks. The benchmark is rigorously validated via GPT-4-assisted annotation, human evaluation, and a custom helpfulness metric. Contribution/Results: Experiments demonstrate substantial improvements: +23.6% in helpfulness (human-rated), −41.2% in harmful response rate, advancing QA systems toward personalization, empathetic reasoning, and practical utility.

Technology Category

Application Category

📝 Abstract

As the integration of large language models into daily life is on the rise, there is a clear gap in benchmarks for advising on subjective and personal dilemmas. To address this, we introduce AdvisorQA, the first benchmark developed to assess LLMs' capability in offering advice for deeply personalized concerns, utilizing the LifeProTips subreddit forum. This forum features a dynamic interaction where users post advice-seeking questions, receiving an average of 8.9 advice per query, with 164.2 upvotes from hundreds of users, embodying a collective intelligence framework. Therefore, we've completed a benchmark encompassing daily life questions, diverse corresponding responses, and majority vote ranking to train our helpfulness metric. Baseline experiments validate the efficacy of AdvisorQA through our helpfulness metric, GPT-4, and human evaluation, analyzing phenomena beyond the trade-off between helpfulness and harmlessness. AdvisorQA marks a significant leap in enhancing QA systems for providing personalized, empathetic advice, showcasing LLMs' improved understanding of human subjectivity.

Problem

Research questions and friction points this paper is trying to address.

Large Model Evaluation

Personalized Advice Quality

Positive Beneficial Suggestions

Innovation

Methods, ideas, or system contributions that make the work stand out.

AdvisorQA

Personalized Advice Evaluation

Large Model Benchmark

🔎 Similar Papers

Ranking Generated Answers: On the Agreement of Retrieval Models with Humans on Consumer Health Questions