Synthetically Expressive: Evaluating gesture and voice for emotion and empathy in VR and 2D scenarios

📅 2025-06-30

📈 Citations: 0

✨ Influential: 0

career value

196K/year

🤖 AI Summary

This study investigates how immersion level (VR vs. 2D) modulates user perception of multimodal emotional expressions (speech + gesture) conveyed by virtual humans, with particular focus on differences in empathy, emotional resonance, and presence between authentic and AI-synthesized signals. A controlled experiment compared the perceptual outcomes of authentic versus synthesized speech and gesture—both individually and in combination—across VR and 2D environments. Results show that while VR enhances perceived alignment of natural multimodal behavior, it markedly amplifies perceptual incongruence between synthesized speech and gesture, leading to significantly lower emotional credibility and empathic response relative to 2D. These findings reveal a critical gap: current multimodal generative models lack immersion-aware co-optimization mechanisms for cross-modal synchronization. This work provides the first empirical evidence that “modality mismatch” is perceptually exacerbated by immersion—a phenomenon we term *immersion-amplified modality misalignment*. It further establishes a new design imperative for VR-oriented speech–gesture co-generation frameworks grounded in perceptual coherence.

Technology Category

Application Category

📝 Abstract

The creation of virtual humans increasingly leverages automated synthesis of speech and gestures, enabling expressive, adaptable agents that effectively engage users. However, the independent development of voice and gesture generation technologies, alongside the growing popularity of virtual reality (VR), presents significant questions about the integration of these signals and their ability to convey emotional detail in immersive environments. In this paper, we evaluate the influence of real and synthetic gestures and speech, alongside varying levels of immersion (VR vs. 2D displays) and emotional contexts (positive, neutral, negative) on user perceptions. We investigate how immersion affects the perceived match between gestures and speech and the impact on key aspects of user experience, including emotional and empathetic responses and the sense of co-presence. Our findings indicate that while VR enhances the perception of natural gesture-voice pairings, it does not similarly improve synthetic ones - amplifying the perceptual gap between them. These results highlight the need to reassess gesture appropriateness and refine AI-driven synthesis for immersive environments. See video: https://youtu.be/WMfjIB1X-dc

Problem

Research questions and friction points this paper is trying to address.

Evaluating gesture and voice impact on emotion in VR and 2D

Assessing integration of synthetic signals for emotional expression

Investigating VR's effect on perceived gesture-voice pairing naturalness

Innovation

Methods, ideas, or system contributions that make the work stand out.

Automated synthesis of speech and gestures

Integration of voice and gesture generation

AI-driven synthesis for immersive environments

🔎 Similar Papers

Empathy Detection from Text, Audiovisual, Audio or Physiological Signals: A Systematic Review of Task Formulations and Machine Learning Methods

2023-10-30Citations: 0