🤖 AI Summary
Current conversational AI systems lack proactivity, responding passively to turn-taking cues and failing to emulate humans’ natural, contextually appropriate engagement. To address this, we propose the “Intrinsic Thinking” framework—the first to integrate insights from cognitive psychology and linguistics into multi-party dialogue modeling, enabling agents with autonomous, continuous thought streams. Methodologically, our approach combines implicit mental-state modeling, real-time intent triggering, multimodal feedback-driven dynamic turn-taking policies, and an end-to-end trainable architecture. Human evaluations demonstrate significant improvements over state-of-the-art baselines: +28.6% in anthropomorphism, +23.1% in coherence, +31.4% in perceived intelligence, and +35.7% in turn appropriateness—across all quantitative and qualitative metrics. Our core contribution is the formal definition and realization of a motivation-driven proactive dialogue paradigm.
📝 Abstract
One of the long-standing aspirations in conversational AI is to allow them to autonomously take initiatives in conversations, i.e., being proactive. This is especially challenging for multi-party conversations. Prior NLP research focused mainly on predicting the next speaker from contexts like preceding conversations. In this paper, we demonstrate the limitations of such methods and rethink what it means for AI to be proactive in multi-party, human-AI conversations. We propose that just like humans, rather than merely reacting to turn-taking cues, a proactive AI formulates its own inner thoughts during a conversation, and seeks the right moment to contribute. Through a formative study with 24 participants and inspiration from linguistics and cognitive psychology, we introduce the Inner Thoughts framework. Our framework equips AI with a continuous, covert train of thoughts in parallel to the overt communication process, which enables it to proactively engage by modeling its intrinsic motivation to express these thoughts. We instantiated this framework into two real-time systems: an AI playground web app and a chatbot. Through a technical evaluation and user studies with human participants, our framework significantly surpasses existing baselines on aspects like anthropomorphism, coherence, intelligence, and turn-taking appropriateness.