Towards the Assessment of Task-based Chatbots: From the TOFU-R Snapshot to the BRASATO Curated Dataset

📅 2025-08-21

📈 Citations: 0

✨ Influential: 0

career value

221K/year

🤖 AI Summary

Current evaluation of task-oriented chatbots suffers from limited validity due to the absence of large-scale, high-quality, and reproducible benchmark datasets; existing approaches rely heavily on small-scale, manually crafted examples or outdated proxy systems. Method: This paper introduces an evaluation paradigm grounded in real-world open-source ecosystems. We systematically construct two novel datasets: TOFU-R, a Rasa ecosystem snapshot dataset derived via automated GitHub crawling and framework-aware parsing; and BRASATO, a curated collection of high-fidelity dialogues rigorously selected and multi-dimensionally annotated for dialogue complexity, functional complexity, and practical utility. Contribution/Results: Both datasets—along with an integrated open-source evaluation toolkit—are publicly released. They significantly enhance reproducibility, ecological relevance, and practical applicability of chatbot evaluation research.

Technology Category

Application Category

📝 Abstract

Task-based chatbots are increasingly being used to deliver real services, yet assessing their reliability, security, and robustness remains underexplored, also due to the lack of large-scale, high-quality datasets. The emerging automated quality assessment techniques targeting chatbots often rely on limited pools of subjects, such as custom-made toy examples, or outdated, no longer available, or scarcely popular agents, complicating the evaluation of such techniques. In this paper, we present two datasets and the tool support necessary to create and maintain these datasets. The first dataset is RASA TASK-BASED CHATBOTS FROM GITHUB (TOFU-R), which is a snapshot of the Rasa chatbots available on GitHub, representing the state of the practice in open-source chatbot development with Rasa. The second dataset is BOT RASA COLLECTION (BRASATO), a curated selection of the most relevant chatbots for dialogue complexity, functional complexity, and utility, whose goal is to ease reproducibility and facilitate research on chatbot reliability.

Problem

Research questions and friction points this paper is trying to address.

Lack of large-scale datasets for chatbot reliability assessment

Insufficient evaluation methods for task-based chatbot security

Need curated datasets to support reproducible chatbot research

Innovation

Methods, ideas, or system contributions that make the work stand out.

TOFU-R dataset captures Rasa GitHub snapshot

BRASATO dataset curates complex functional chatbots

Tool support enables dataset creation and maintenance

🔎 Similar Papers

No similar papers found.