Seamless Interaction: Dyadic Audiovisual Motion Modeling and Large-Scale Dataset

📅 2025-06-27
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Modeling the dynamic multimodal coordination of speech, gestures, and facial expressions in face-to-face social interaction remains challenging for socially intelligent AI. Method: We introduce the first large-scale dyadic audio-visual interaction dataset (4,000+ hours) and propose a cross-modal sequential model integrating ASR, visual behavioral encoding, LLM-driven speech generation, and 2D/3D rendering to generate context-aware coordinated actions. A novel cross-modal alignment network enhances dyadic action prediction accuracy. Contribution/Results: Our framework enables fine-grained, emotion-state-, intensity-, and semantic-intent-conditioned controllable generation of gestures and facial expressions. Experiments demonstrate significant improvements in motion coherence and affective alignment of virtual agents. User studies confirm substantial gains in perceived naturalness and interaction quality, validating the efficacy of our approach for embodied social AI.

Technology Category

Application Category

📝 Abstract
Human communication involves a complex interplay of verbal and nonverbal signals, essential for conveying meaning and achieving interpersonal goals. To develop socially intelligent AI technologies, it is crucial to develop models that can both comprehend and generate dyadic behavioral dynamics. To this end, we introduce the Seamless Interaction Dataset, a large-scale collection of over 4,000 hours of face-to-face interaction footage from over 4,000 participants in diverse contexts. This dataset enables the development of AI technologies that understand dyadic embodied dynamics, unlocking breakthroughs in virtual agents, telepresence experiences, and multimodal content analysis tools. We also develop a suite of models that utilize the dataset to generate dyadic motion gestures and facial expressions aligned with human speech. These models can take as input both the speech and visual behavior of their interlocutors. We present a variant with speech from an LLM model and integrations with 2D and 3D rendering methods, bringing us closer to interactive virtual agents. Additionally, we describe controllable variants of our motion models that can adapt emotional responses and expressivity levels, as well as generating more semantically-relevant gestures. Finally, we discuss methods for assessing the quality of these dyadic motion models, which are demonstrating the potential for more intuitive and responsive human-AI interactions.
Problem

Research questions and friction points this paper is trying to address.

Model dyadic audiovisual behavior for human-AI interaction
Create large-scale dataset for embodied dynamics analysis
Generate context-aware gestures and expressions from speech
Innovation

Methods, ideas, or system contributions that make the work stand out.

Large-scale dyadic interaction dataset creation
Multimodal motion and speech generation models
Controllable emotional and expressive motion variants
🔎 Similar Papers
No similar papers found.
Vasu Agrawal
Vasu Agrawal
Facebook Reality Labs
A
Akinniyi Akinyemi
Meta
K
Kathryn Alvero
Meta
Morteza Behrooz
Morteza Behrooz
Meta AI
AI + HCIHRINarrative Generation
J
Julia Buffalini
Meta
Fabio Maria Carlucci
Fabio Maria Carlucci
Meta
Machine Learningcomputer visionobject recognition
J
Joy Chen
Meta
Junming Chen
Junming Chen
Meta
Z
Zhang Chen
Meta
Shiyang Cheng
Shiyang Cheng
Meta
P
Praveen Chowdary
Meta
J
Joe Chuang
Meta
A
Antony D'Avirro
Meta
J
Jon Daly
Meta
N
Ning Dong
Meta
M
Mark Duppenthaler
Meta
C
Cynthia Gao
Meta
J
Jeff Girard
University of Kansas
M
Martin Gleize
Meta
S
Sahir Gomez
Meta
Hongyu Gong
Hongyu Gong
Fundamental AI Research at Meta
Natural Language Processing
S
Srivathsan Govindarajan
Meta
B
Brandon Han
Meta
S
Sen He
Meta
D
Denise Hernandez
Meta