🤖 AI Summary
This work addresses the scarcity of large-scale, high-quality datasets for task-oriented dialogue systems by introducing TOFU-D, the first publicly released dataset comprising 1,788 real-world Dialogflow chatbots. A rigorously curated, human-validated subset, COD (185 bots), spans multiple domains, languages, and implementation paradigms. Through systematic collection via GitHub, manual verification, and comprehensive evaluation using Botium and Bandit tools, the study reveals widespread deficiencies in test coverage and prevalent security vulnerabilities across most bots. These findings underscore the urgent need for systematic quality assurance mechanisms in conversational AI and establish a robust empirical foundation for future research on cross-platform chatbot quality and security.
📝 Abstract
In recent years, chatbots have gained widespread adoption thanks to their ability to assist users at any time and across diverse domains. However, the lack of large-scale curated datasets limits research on their quality and reliability. This paper presents TOFU-D, a snapshot of 1,788 Dialogflow chatbots from GitHub, and COD, a curated subset of TOFU-D including 185 validated chatbots. The two datasets capture a wide range of domains, languages, and implementation patterns, offering a sound basis for empirical studies on chatbot quality and security. A preliminary assessment using the Botium testing framework and the Bandit static analyzer revealed gaps in test coverage and frequent security vulnerabilities in several chatbots, highlighting the need for systematic, multi-Platform research on chatbot quality and security.