MIBURI: Towards Expressive Interactive Gesture Synthesis

📅 2026-03-03

📈 Citations: 0

✨ Influential: 0

career value

209K/year

🤖 AI Summary

This work addresses the lack of natural, expressive real-time body and facial gestures in current large language model (LLM)-driven conversational agents, which hinders human-like interaction. The authors propose the first online causal generation framework that encodes full-body motion into hierarchical discrete tokens via a body-part-aware motion codec and jointly generates synchronized gestures and facial expressions through 2D causal autoregressive modeling conditioned on LLM-derived speech and text embeddings. This approach enables, for the first time, causal, real-time, full-body and facial expression co-generation with low latency. Auxiliary learning objectives are introduced to enhance motion diversity and expressiveness while mitigating static convergence. Experiments demonstrate significant improvements over recent baselines in motion naturalness and contextual alignment, establishing a new paradigm for highly expressive interactive virtual agents.

Technology Category

Application Category

📝 Abstract

Embodied Conversational Agents (ECAs) aim to emulate human face-to-face interaction through speech, gestures, and facial expressions. Current large language model (LLM)-based conversational agents lack embodiment and the expressive gestures essential for natural interaction. Existing solutions for ECAs often produce rigid, low-diversity motions, that are unsuitable for human-like interaction. Alternatively, generative methods for co-speech gesture synthesis yield natural body gestures but depend on future speech context and require long run-times. To bridge this gap, we present MIBURI, the first online, causal framework for generating expressive full-body gestures and facial expressions synchronized with real-time spoken dialogue. We employ body-part aware gesture codecs that encode hierarchical motion details into multi-level discrete tokens. These tokens are then autoregressively generated by a two-dimensional causal framework conditioned on LLM-based speech-text embeddings, modeling both temporal dynamics and part-level motion hierarchy in real time. Further, we introduce auxiliary objectives to encourage expressive and diverse gestures while preventing convergence to static poses. Comparative evaluations demonstrate that our causal and real-time approach produces natural and contextually aligned gestures against recent baselines. We urge the reader to explore demo videos on https://vcai.mpi-inf.mpg.de/projects/MIBURI/.

Problem

Research questions and friction points this paper is trying to address.

Embodied Conversational Agents

co-speech gesture synthesis

real-time interaction

expressive gestures

full-body motion

Innovation

Methods, ideas, or system contributions that make the work stand out.

causal gesture synthesis

expressive full-body animation

real-time embodied agent