TRiMM: Transformer-Based Rich Motion Matching for Real-Time multi-modal Interaction in Digital Humans

📅 2025-06-01

📈 Citations: 0

✨ Influential: 0

career value

190K/year

🤖 AI Summary

To address the challenges of long-text comprehension and high latency in speech–gesture co-synthesis for LLM-driven digital humans in real-time interaction, this paper proposes an end-to-end real-time speech-to-3D-gesture generation framework. Our method introduces three key innovations: (1) a cross-modal temporal alignment attention mechanism enabling fine-grained, dynamic alignment between acoustic features and gesture motions; (2) a sliding-window-based autoregressive modeling scheme for long-context understanding—supporting text inputs up to hundreds of words; and (3) an atomic motion library–driven lightweight retrieval-matching module ensuring millisecond-level response. Built upon a Transformer architecture, the system achieves 120 fps rendering on an RTX 3060 GPU and a sentence-level end-to-end latency of only 0.15 seconds. Comprehensive evaluations on ZEGGS and BEAT benchmarks surpass all prior state-of-the-art methods, marking the first demonstration of high-fidelity, low-latency, and deployable multimodal real-time interaction.

Technology Category

Application Category

📝 Abstract

Large Language Model (LLM)-driven digital humans have sparked a series of recent studies on co-speech gesture generation systems. However, existing approaches struggle with real-time synthesis and long-text comprehension. This paper introduces Transformer-Based Rich Motion Matching (TRiMM), a novel multi-modal framework for real-time 3D gesture generation. Our method incorporates three modules: 1) a cross-modal attention mechanism to achieve precise temporal alignment between speech and gestures; 2) a long-context autoregressive model with a sliding window mechanism for effective sequence modeling; 3) a large-scale gesture matching system that constructs an atomic action library and enables real-time retrieval. Additionally, we develop a lightweight pipeline implemented in the Unreal Engine for experimentation. Our approach achieves real-time inference at 120 fps and maintains a per-sentence latency of 0.15 seconds on consumer-grade GPUs (Geforce RTX3060). Extensive subjective and objective evaluations on the ZEGGS, and BEAT datasets demonstrate that our model outperforms current state-of-the-art methods. TRiMM enhances the speed of co-speech gesture generation while ensuring gesture quality, enabling LLM-driven digital humans to respond to speech in real time and synthesize corresponding gestures. Our code is available at https://github.com/teroon/TRiMM-Transformer-Based-Rich-Motion-Matching

Problem

Research questions and friction points this paper is trying to address.

Real-time 3D gesture generation for digital humans

Precise temporal alignment between speech and gestures

Long-text comprehension in co-speech gesture synthesis

Innovation

Methods, ideas, or system contributions that make the work stand out.

Cross-modal attention for speech-gesture alignment

Sliding window autoregressive model for sequences

Large-scale atomic action library for real-time retrieval

🔎 Similar Papers

No similar papers found.