TEA-Bench: A Systematic Benchmarking of Tool-enhanced Emotional Support Dialogue Agent

📅 2026-01-26

📈 Citations: 0

✨ Influential: 0

career value

187K/year

🤖 AI Summary

Existing empathetic dialogue systems and their evaluation benchmarks primarily focus on textual emotional expression, overlooking the role of external tools in grounding responses in factual evidence and mitigating hallucinations. This work proposes the first interactive evaluation framework for tool-augmented empathetic dialogue, constructing an MCP-style tool environment grounded in real-world emotional scenarios, introducing process-level evaluation metrics, and releasing the accompanying TEA-Dialog dataset. Experiments demonstrate that tool augmentation consistently improves empathetic support quality and reduces hallucination across nine large language models, though the extent of improvement is highly dependent on model capability. While supervised fine-tuning enhances in-domain performance, generalization remains limited. This study reveals a strong correlation between effective tool usage and model capacity, offering a new paradigm for building trustworthy empathetic dialogue systems.

Technology Category

Application Category

📝 Abstract

Emotional Support Conversation requires not only affective expression but also grounded instrumental support to provide trustworthy guidance. However, existing ESC systems and benchmarks largely focus on affective support in text-only settings, overlooking how external tools can enable factual grounding and reduce hallucination in multi-turn emotional support. We introduce TEA-Bench, the first interactive benchmark for evaluating tool-augmented agents in ESC, featuring realistic emotional scenarios, an MCP-style tool environment, and process-level metrics that jointly assess the quality and factual grounding of emotional support. Experiments on nine LLMs show that tool augmentation generally improves emotional support quality and reduces hallucination, but the gains are strongly capacity-dependent: stronger models use tools more selectively and effectively, while weaker models benefit only marginally. We further release TEA-Dialog, a dataset of tool-enhanced ESC dialogues, and find that supervised fine-tuning improves in-distribution support but generalizes poorly. Our results underscore the importance of tool use in building reliable emotional support agents.

Problem

Research questions and friction points this paper is trying to address.

Emotional Support Conversation

tool augmentation

hallucination

factual grounding

benchmarking

Innovation

Methods, ideas, or system contributions that make the work stand out.

tool-augmented dialogue

emotional support conversation

hallucination reduction