SYNTHEMPATHY: A Scalable Empathy Corpus Generated Using LLMs Without Any Crowdsourcing

📅 2025-02-25

📈 Citations: 0

✨ Influential: 0

career value

197K/year

🤖 AI Summary

Existing empathetic dialogue datasets rely heavily on costly, labor-intensive crowd-sourcing, creating severe data scarcity and annotation bottlenecks for large-scale LLM empathetic fine-tuning. To address this, we propose the first fully LLM-driven, closed-loop framework for empathetic data synthesis, producing SYNTHEMPATHY—a large-scale, open-source corpus of 105K high-quality empathetic responses grounded in authentic life scenarios. Our method integrates Mistral-7B instruction tuning, multi-stage empathetic prompting, intent alignment, and response quality filtering—entirely eliminating human annotation. Experiments demonstrate that models fine-tuned on SYNTHEMPATHY achieve statistically significant improvements in standard empathetic evaluation benchmarks, confirming both the efficacy and generalizability of the synthesized data. This work establishes a scalable, low-cost, high-fidelity data foundation for empathetic dialogue modeling.

Technology Category

Application Category

📝 Abstract

Previous research has shown that humans are more receptive towards language models that that exhibit empathetic behavior. While empathy is essential for developing helpful dialogue agents, very few large corpora containing empathetic dialogues are available for fine-tune LLMs. The few existing corpora have largely relied on crowdsourcing to simulate empathetic conversations, a process that is expensive, time-consuming, and not scalable to larger datasets. We propose a data generation framework for developing SYNTHEMPATHY, a large corpus containing 105k empathetic responses to real-life situations compiled through LLM generation. A base Mistral 7B model fine-tuned on our SYNTHEMPATHY corpus exhibits an increase in the average empathy score.

Problem

Research questions and friction points this paper is trying to address.

Generates empathetic dialogue corpus

Avoids costly crowdsourcing methods

Enhances LLM empathy performance

Innovation

Methods, ideas, or system contributions that make the work stand out.

LLM-generated empathy corpus

No crowdsourcing for dataset

Fine-tuning Mistral 7B model

🔎 Similar Papers

No similar papers found.