Speaking Beyond Language: A Large-Scale Multimodal Dataset for Learning Nonverbal Cues from Video-Grounded Dialogues

📅 2025-06-01
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Current large language models struggle to model nonverbal cues—such as gestures, facial expressions, and body language—hindering the development of immersive conversational AI. To address this, we introduce VENUS, the first large-scale multimodal dataset for video-driven dialogue, featuring time-aligned dialogue transcripts, fine-grained facial action units, and full-body pose annotations. Methodologically, we propose a novel nonverbal vector quantization (VQ) representation built upon a VQ-VAE, and design MARS: a text-video-action tri-modal joint modeling framework trained end-to-end via multimodal next-token prediction. Experimental results demonstrate that MARS significantly outperforms baselines in nonverbal cue generation and contextual consistency. Quantitative and qualitative analyses confirm VENUS’s broad coverage and high-quality annotations. Collectively, this work establishes a foundational resource and methodology for nonverbal understanding and generation in conversational AI.

Technology Category

Application Category

📝 Abstract
Nonverbal communication is integral to human interaction, with gestures, facial expressions, and body language conveying critical aspects of intent and emotion. However, existing large language models (LLMs) fail to effectively incorporate these nonverbal elements, limiting their capacity to create fully immersive conversational experiences. We introduce MARS, a multimodal language model designed to understand and generate nonverbal cues alongside text, bridging this gap in conversational AI. Our key innovation is VENUS, a large-scale dataset comprising annotated videos with time-aligned text, facial expressions, and body language. Leveraging VENUS, we train MARS with a next-token prediction objective, combining text with vector-quantized nonverbal representations to achieve multimodal understanding and generation within a unified framework. Based on various analyses of the VENUS datasets, we validate its substantial scale and high effectiveness. Our quantitative and qualitative results demonstrate that MARS successfully generates text and nonverbal languages, corresponding to conversational input.
Problem

Research questions and friction points this paper is trying to address.

LLMs lack nonverbal cue integration in conversations
Need multimodal models for text and nonverbal communication
Absence of large-scale annotated video datasets for training
Innovation

Methods, ideas, or system contributions that make the work stand out.

Multimodal model MARS integrates nonverbal cues
VENUS dataset combines text and video annotations
Unified framework for multimodal understanding and generation
🔎 Similar Papers
No similar papers found.
Y
Youngmin Kim
Yonsei University
Jiwan Chung
Jiwan Chung
Yonsei University
Computer VisionNLPMultimodal Learning
J
Jisoo Kim
Yonsei University
S
Sunghyun Lee
Yonsei University
S
Sangkyu Lee
Yonsei University
J
Junhyeok Kim
Yonsei University
C
Cheoljong Yang
NC Research, NCSOFT Corporation
Y
Youngjae Yu
Yonsei University