A Dataset and Benchmark for Consumer Healthcare Question Summarization

📅 2025-12-29

📈 Citations: 0

✨ Influential: 0

career value

195K/year

🤖 AI Summary

Consumer health queries (CHQs) frequently contain redundant and verbose descriptions, introducing semantic noise that hinders medical natural language understanding (NLU) tasks; moreover, there is a critical lack of domain-expert-annotated summarization benchmarks for this setting. To address this, we introduce CHQ-Sum—the first clinical- and informatics-expert-annotated dataset (1,507 social media–sourced health questions) explicitly designed for extracting salient medical information. Our methodology involves collecting real-world community Q&A data, establishing rigorous annotation guidelines, and conducting systematic evaluation across state-of-the-art models (BERT, PEGASUS, BART). Experimental results reveal low ROUGE-L scores (mean: 32.1), underscoring the necessity of domain adaptation. CHQ-Sum is publicly released, filling a key data gap in medical question summarization and providing a new standard benchmark for community health content understanding and downstream healthcare AI applications.

Technology Category

Application Category

📝 Abstract

The quest for seeking health information has swamped the web with consumers health-related questions. Generally, con- sumers use overly descriptive and peripheral information to express their medical condition or other healthcare needs, contributing to the challenges of natural language understanding. One way to address this challenge is to summarize the questions and distill the key information of the original question. Recently, large-scale datasets have significantly propelled the development of several summarization tasks, such as multi-document summarization and dialogue summarization. However, a lack of a domain-expert annotated dataset for the consumer healthcare questions summarization task inhibits the development of an efficient summarization system. To address this issue, we introduce a new dataset, CHQ-Sum,m that contains 1507 domain-expert annotated consumer health questions and corresponding summaries. The dataset is derived from the community question answering forum and therefore provides a valuable resource for understanding consumer health-related posts on social media. We benchmark the dataset on multiple state-of-the-art summarization models to show the effectiveness of the dataset

Problem

Research questions and friction points this paper is trying to address.

Summarizes consumer healthcare questions to extract key information

Addresses lack of domain-expert annotated dataset for healthcare summarization

Benchmarks models on new dataset to improve healthcare question understanding

Innovation

Methods, ideas, or system contributions that make the work stand out.

Introduces CHQ-Sum dataset with expert annotations

Derives data from community question answering forums

Benchmarks dataset using state-of-the-art summarization models

🔎 Similar Papers

No similar papers found.