TRiMM: Transformer-Based Rich Motion Matching for Real-Time multi-modal Interaction in Digital Humans

📅 2025-06-01
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address the challenges of long-text comprehension and high latency in speech–gesture co-synthesis for LLM-driven digital humans in real-time interaction, this paper proposes an end-to-end real-time speech-to-3D-gesture generation framework. Our method introduces three key innovations: (1) a cross-modal temporal alignment attention mechanism enabling fine-grained, dynamic alignment between acoustic features and gesture motions; (2) a sliding-window-based autoregressive modeling scheme for long-context understanding—supporting text inputs up to hundreds of words; and (3) an atomic motion library–driven lightweight retrieval-matching module ensuring millisecond-level response. Built upon a Transformer architecture, the system achieves 120 fps rendering on an RTX 3060 GPU and a sentence-level end-to-end latency of only 0.15 seconds. Comprehensive evaluations on ZEGGS and BEAT benchmarks surpass all prior state-of-the-art methods, marking the first demonstration of high-fidelity, low-latency, and deployable multimodal real-time interaction.

Technology Category

Application Category

📝 Abstract
Large Language Model (LLM)-driven digital humans have sparked a series of recent studies on co-speech gesture generation systems. However, existing approaches struggle with real-time synthesis and long-text comprehension. This paper introduces Transformer-Based Rich Motion Matching (TRiMM), a novel multi-modal framework for real-time 3D gesture generation. Our method incorporates three modules: 1) a cross-modal attention mechanism to achieve precise temporal alignment between speech and gestures; 2) a long-context autoregressive model with a sliding window mechanism for effective sequence modeling; 3) a large-scale gesture matching system that constructs an atomic action library and enables real-time retrieval. Additionally, we develop a lightweight pipeline implemented in the Unreal Engine for experimentation. Our approach achieves real-time inference at 120 fps and maintains a per-sentence latency of 0.15 seconds on consumer-grade GPUs (Geforce RTX3060). Extensive subjective and objective evaluations on the ZEGGS, and BEAT datasets demonstrate that our model outperforms current state-of-the-art methods. TRiMM enhances the speed of co-speech gesture generation while ensuring gesture quality, enabling LLM-driven digital humans to respond to speech in real time and synthesize corresponding gestures. Our code is available at https://github.com/teroon/TRiMM-Transformer-Based-Rich-Motion-Matching
Problem

Research questions and friction points this paper is trying to address.

Real-time 3D gesture generation for digital humans
Precise temporal alignment between speech and gestures
Long-text comprehension in co-speech gesture synthesis
Innovation

Methods, ideas, or system contributions that make the work stand out.

Cross-modal attention for speech-gesture alignment
Sliding window autoregressive model for sequences
Large-scale atomic action library for real-time retrieval
🔎 Similar Papers
No similar papers found.
Y
Yueqian Guo
Jiangxi University of Finance and Economics, China
T
Tianzhao Li
Communication University of China, China
Xin Lyu
Xin Lyu
Graduate Student, University of California, Berkeley
pseudorandomnessdifferential privacycomputational complexityalgorithms
J
Jiehaolin Chen
Communication University of China, China
Z
Zhaohan Wang
Communication University of China, China
S
Sirui Xiao
Communication University of China, China
Yurun Chen
Yurun Chen
Master Student of Science, Tsinghua University
3D vision
Y
Yezi He
Communication University of China, China
H
Helin Li
Communication University of China, China
F
Fan Zhang
Communication University of Zhejiang, China