🤖 AI Summary
This work addresses the limitation of existing text auto-completion methods, which disregard visual context and thus struggle to accurately infer user intent in multimodal conversations. We introduce the first multimodal auto-completion task, which predicts subsequent characters by jointly leveraging partial input text and visual information, and present the first benchmark dataset for this purpose. To tackle this challenge, we propose the Router-Suggest framework, featuring a context-aware dynamic routing mechanism that intelligently switches between a text-only model and a vision-language model to balance performance and efficiency. A lightweight variant is also provided for resource-constrained settings. Experiments demonstrate that Router-Suggest achieves 2.3–10× speedup over the best vision-language models while significantly improving user satisfaction, reducing typing effort, and enhancing completion quality in multi-turn dialogues.
📝 Abstract
Real-time multimodal auto-completion is essential for digital assistants, chatbots, design tools, and healthcare consultations, where user inputs rely on shared visual context. We introduce Multimodal Auto-Completion (MAC), a task that predicts upcoming characters in live chats using partially typed text and visual cues. Unlike traditional text-only auto-completion (TAC), MAC grounds predictions in multimodal context to better capture user intent. To enable this task, we adapt MMDialog and ImageChat to create benchmark datasets. We evaluate leading vision-language models (VLMs) against strong textual baselines, highlighting trade-offs in accuracy and efficiency. We present Router-Suggest, a router framework that dynamically selects between textual models and VLMs based on dialog context, along with a lightweight variant for resource-constrained environments. Router-Suggest achieves a 2.3x to 10x speedup over the best-performing VLM. A user study shows that VLMs significantly excel over textual models on user satisfaction, notably saving user typing effort and improving the quality of completions in multi-turn conversations. These findings underscore the need for multimodal context in auto-completions, leading to smarter, user-aware assistants.