MAG: Multi-Modal Aligned Autoregressive Co-Speech Gesture Generation without Vector Quantization

📅 2025-03-18

📈 Citations: 0

✨ Influential: 0

career value

227K/year

🤖 AI Summary

Existing speech-driven full-body gesture generation methods rely on vector quantization and autoregressive modeling, leading to loss of motion details and limited realism and diversity. To address this, we propose a multimodal-aligned autoregressive generative framework operating directly in continuous latent space. First, we introduce the Motion-Text-Audio-aligned Variational Autoencoder (MTA-VAE), which jointly encodes semantic and rhythmic information across motion, text, and audio modalities. Second, we design the Quantization-Free Multimodal Masked Autoregressive Diffusion model (MMAG), integrating WavCaps audio-text embeddings, continuous diffusion modeling, hybrid-granularity audio-text fusion blocks, and a variational architecture to jointly enforce semantic consistency, temporal synchronization, and cross-modal coherence in an end-to-end manner. Our approach achieves state-of-the-art performance on two major benchmarks, with significant improvements in quantitative metrics and qualitatively superior realism, naturalness, and gesture diversity.

Technology Category

Application Category

📝 Abstract

This work focuses on full-body co-speech gesture generation. Existing methods typically employ an autoregressive model accompanied by vector-quantized tokens for gesture generation, which results in information loss and compromises the realism of the generated gestures. To address this, inspired by the natural continuity of real-world human motion, we propose MAG, a novel multi-modal aligned framework for high-quality and diverse co-speech gesture synthesis without relying on discrete tokenization. Specifically, (1) we introduce a motion-text-audio-aligned variational autoencoder (MTA-VAE), which leverages pre-trained WavCaps' text and audio embeddings to enhance both semantic and rhythmic alignment with motion, ultimately producing more realistic gestures. (2) Building on this, we propose a multimodal masked autoregressive model (MMAG) that enables autoregressive modeling in continuous motion embeddings through diffusion without vector quantization. To further ensure multi-modal consistency, MMAG incorporates a hybrid granularity audio-text fusion block, which serves as conditioning for diffusion process. Extensive experiments on two benchmark datasets demonstrate that MAG achieves stateof-the-art performance both quantitatively and qualitatively, producing highly realistic and diverse co-speech gestures.The code will be released to facilitate future research.

Problem

Research questions and friction points this paper is trying to address.

Addresses information loss in co-speech gesture generation

Proposes MAG for realistic gestures without vector quantization

Enhances semantic and rhythmic alignment using MTA-VAE

Innovation

Methods, ideas, or system contributions that make the work stand out.

Motion-text-audio-aligned variational autoencoder for realistic gestures

Multimodal masked autoregressive model with continuous motion embeddings

Hybrid granularity audio-text fusion for multi-modal consistency

🔎 Similar Papers

No similar papers found.

TikTok

San Jose, California

PhD – Generative Models for Closed-loop Synthesis

Bosch Group

Renningen, BW, DE

Authors to Follow