🤖 AI Summary
Existing empathetic dialogue systems struggle to fulfill users’ diverse social support needs. This work proposes the first tool-augmented intelligent agent framework tailored for personalized companionship, grounded in social support theory, which delivers tangible assistance by invoking user-centric tools that simulate multimedia applications. To advance research in this direction, we introduce ComPASS-Bench, the first benchmark specifically designed for large language model (LLM)-based companion agents. Leveraging a multi-tool calling mechanism, an automated yet human-refined data synthesis pipeline, and instruction tuning based on Qwen3-8B, we train the ComPASS-Qwen model. Experimental results demonstrate that our approach significantly outperforms baseline methods and achieves response quality comparable to state-of-the-art large models, thereby validating the efficacy of tool augmentation in enhancing companionship performance.
📝 Abstract
Developing compassionate interactive systems requires agents to not only understand user emotions but also provide diverse, substantive support. While recent works explore empathetic dialogue generation, they remain limited in response form and content, struggling to satisfy diverse needs across users and contexts. To address this, we explore empowering agents with external tools to execute diverse actions. Grounded in the psychological concept of "social support", this paradigm delivers substantive, human-like companionship. Specifically, we first design a dozen user-centric tools simulating various multimedia applications, which can cover different types of social support behaviors in human-agent interaction scenarios. We then construct ComPASS-Bench, the first personalized social support benchmark for LLM-based agents, via multi-step automated synthesis and manual refinement. Based on ComPASS-Bench, we further synthesize tool use records to fine-tune the Qwen3-8B model, yielding a task-specific ComPASS-Qwen. Comprehensive evaluations across two settings reveal that while the evaluated LLMs can generate valid tool-calling requests with high success rates, significant gaps remain in final response quality. Moreover, tool-augmented responses achieve better overall performance than directly producing conversational empathy. Notably, our trained ComPASS-Qwen demonstrates substantial improvements over its base model, achieving comparable performance to several large-scale models. Our code and data are available at https://github.com/hzp3517/ComPASS.