Assessing Task-based Chatbots: Snapshot and Curated Datasets for Dialogflow

📅 2026-01-27
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the scarcity of large-scale, high-quality datasets for task-oriented dialogue systems by introducing TOFU-D, the first publicly released dataset comprising 1,788 real-world Dialogflow chatbots. A rigorously curated, human-validated subset, COD (185 bots), spans multiple domains, languages, and implementation paradigms. Through systematic collection via GitHub, manual verification, and comprehensive evaluation using Botium and Bandit tools, the study reveals widespread deficiencies in test coverage and prevalent security vulnerabilities across most bots. These findings underscore the urgent need for systematic quality assurance mechanisms in conversational AI and establish a robust empirical foundation for future research on cross-platform chatbot quality and security.

Technology Category

Application Category

📝 Abstract
In recent years, chatbots have gained widespread adoption thanks to their ability to assist users at any time and across diverse domains. However, the lack of large-scale curated datasets limits research on their quality and reliability. This paper presents TOFU-D, a snapshot of 1,788 Dialogflow chatbots from GitHub, and COD, a curated subset of TOFU-D including 185 validated chatbots. The two datasets capture a wide range of domains, languages, and implementation patterns, offering a sound basis for empirical studies on chatbot quality and security. A preliminary assessment using the Botium testing framework and the Bandit static analyzer revealed gaps in test coverage and frequent security vulnerabilities in several chatbots, highlighting the need for systematic, multi-Platform research on chatbot quality and security.
Problem

Research questions and friction points this paper is trying to address.

chatbot quality
curated datasets
security vulnerabilities
test coverage
task-based chatbots
Innovation

Methods, ideas, or system contributions that make the work stand out.

chatbot dataset
Dialogflow
curated benchmark
security analysis
empirical evaluation
🔎 Similar Papers
No similar papers found.
E
Elena Masserini
University of Milano-Bicocca, Milan, Italy
Diego Clerissi
Diego Clerissi
-
Software Engineering
D
D. Micucci
University of Milano-Bicocca, Milan, Italy
Leonardo Mariani
Leonardo Mariani
University of Milano - Bicocca
Software TestingSoftware AnalysisSoftware EngineeringComputer Science