🤖 AI Summary
This work addresses the limitation of existing embodied social agents, which often rely on passive responses and struggle to balance long-term contextual awareness with proactive behavior during real-time interaction. To overcome this, we propose ProAct, a dual-system architecture comprising a low-latency behavioral system for real-time multimodal interaction and a slower cognitive system dedicated to long-term social reasoning and proactive intention generation. By leveraging a ControlNet-based conditional streaming alignment model and an asynchronous intention injection mechanism, ProAct seamlessly integrates high-level intentions into continuous nonverbal behaviors. This approach is the first to fluidly unify reactive and proactive behaviors in embodied agents, enabling asynchronous intention insertion and smooth postural transitions. User studies demonstrate that ProAct significantly enhances perceived proactivity, social presence, and engagement in human-agent interactions.
📝 Abstract
Embodied social agents have recently advanced in generating synchronized speech and gestures. However, most interactive systems remain fundamentally reactive, responding only to current sensory inputs within a short temporal window. Proactive social behavior, in contrast, requires deliberation over accumulated context and intent inference, which conflicts with the strict latency budget of real-time interaction. We present \emph{ProAct}, a dual-system framework that reconciles this time-scale conflict by decoupling a low-latency \emph{Behavioral System} for streaming multimodal interaction from a slower \emph{Cognitive System} which performs long-horizon social reasoning and produces high-level proactive intentions. To translate deliberative intentions into continuous non-verbal behaviors without disrupting fluency, we introduce a streaming flow-matching model conditioned on intentions via ControlNet. This mechanism supports asynchronous intention injection, enabling seamless transitions between reactive and proactive gestures within a single motion stream. We deploy ProAct on a physical humanoid robot and evaluate both motion quality and interactive effectiveness. In real-world interaction user studies, participants and observers consistently prefer ProAct over reactive variants in perceived proactivity, social presence, and overall engagement, demonstrating the benefits of dual-system proactive control for embodied social interaction.