🤖 AI Summary
This work addresses the lack of standardized benchmarks and methodological comparisons for query auto-completion (Chat-Ghosting) in conversational systems. We construct a multi-dataset evaluation framework to systematically compare non-neural approaches—including trie-based and n-gram models—against neural models such as T5 and Phi-2. Methodologically, we propose a novel entropy-based dynamic early-stopping strategy and conduct, for the first time, context-aware completion experiments on both human-human and human-bot mixed dialogue data. Results show that traditional methods outperform neural models in accuracy and inference efficiency for *seen* prefixes, whereas large language models exhibit superior generalization to *unseen* prefixes; moreover, explicit modeling of conversational context significantly improves completion quality. This study establishes the first open-source benchmark for Chat-Ghosting, provides a fully reproducible methodology, and delivers key design insights for practical deployment.
📝 Abstract
Ghosting, the ability to predict a user's intended text input for inline query auto-completion, is an invaluable feature for modern search engines and chat interfaces, greatly enhancing user experience. By suggesting completions to incomplete queries (or prefixes), ghosting aids users with slow typing speeds, disabilities, or limited language proficiency. Ghosting is a challenging problem and has become more important with the ubiquitousness of chat-based systems like ChatGPT, Copilot, etc. Despite the increasing prominence of chat-based systems utilizing ghosting, this challenging problem of Chat-Ghosting has received little attention from the NLP/ML research community. There is a lack of standardized benchmarks and relative performance analysis of deep learning and non-deep learning methods. We address this through an open and thorough study of this problem using four publicly available dialog datasets: two human-human (DailyDialog and DSTC7-Ubuntu) and two human-bot (Open Assistant and ShareGPT). We experiment with various existing query auto-completion methods (using tries), n-gram methods and deep learning methods, with and without dialog context. We also propose a novel entropy-based dynamic early stopping strategy. Our analysis finds that statistical n-gram models and tries outperform deep learning based models in terms of both model performance and inference efficiency for seen prefixes. For unseen queries, neural models like T5 and Phi-2 lead to better results. Adding conversational context leads to significant improvements in ghosting quality, especially for Open-Assistant and ShareGPT. We make code and data publicly available