MARRS: Masked Autoregressive Unit-based Reaction Synthesis

📅 2025-05-16
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the human action–reaction synthesis task—generating temporally coordinated, fine-grained reactive motion (especially hand–torso coordination) conditioned on observed partner actions. To overcome information loss and poor codebook utilization inherent in vector-quantized autoregressive modeling, we propose MARRS: a novel framework featuring a Unit-Discriminative Variational Autoencoder (UD-VAE) for high-fidelity continuous motion representation learning. MARRS further introduces Action-Conditioned Fusion (ACF) and Adaptive Unit Modulation (AUM) to enable cross-unit temporal modeling and precise hand-motion generation. The method integrates unit-based motion encoding, masked autoregressive modeling, and lightweight MLP-based diffusion noise prediction. Quantitative and qualitative evaluations demonstrate that MARRS significantly outperforms state-of-the-art methods in temporal coherence, anatomical plausibility, and hand-motion fidelity.

Technology Category

Application Category

📝 Abstract
This work aims at a challenging task: human action-reaction synthesis, i.e., generating human reactions based on the action sequence of the other as conditions. Currently, autoregressive modeling approaches have achieved remarkable performance in motion generation tasks, e.g. text-to-motion. However, vector quantization (VQ) accompanying autoregressive generation has inherent disadvantages, including loss of quantization information, low codebook utilization, etc. Moreover, unlike text-to-motion, which focuses solely on the movement of body joints, human action-reaction synthesis also encompasses fine-grained hand movements. In this work, we propose MARRS, a novel framework designed to generate coordinated and fine-grained reaction motions in continuous representations. Initially, we present the Unit-distinguished Motion Variational AutoEncoder (UD-VAE), which segments the entire body into distinct body and hand units, encoding them independently. Subsequently, we propose Action-Conditioned Fusion (ACF), which involves randomly masking a subset of reactive tokens and extracting specific information about the body and hands from the active tokens. Furthermore, we introduce Adaptive Unit Modulation (AUM) to facilitate interaction between body and hand units by using the information from one unit to adaptively modulate the other. Finally, for the diffusion model, we employ a compact MLP as a noise predictor for each distinct body unit and incorporate the diffusion loss to model the probability distribution of each token. Quantitative and qualitative results demonstrate that our method achieves superior performance. The code will be released upon acceptance.
Problem

Research questions and friction points this paper is trying to address.

Generating human reactions from action sequences with fine-grained hand movements
Overcoming limitations of vector quantization in autoregressive motion generation
Coordinating body and hand units for realistic reaction synthesis
Innovation

Methods, ideas, or system contributions that make the work stand out.

Unit-distinguished Motion VAE for independent encoding
Action-Conditioned Fusion with random token masking
Adaptive Unit Modulation for inter-unit interaction
🔎 Similar Papers
No similar papers found.
Y
Y. B. Wang
Zhejiang University
S
S Wang
Youtu Lab, Tencent
J
J. N. Zhang
Youtu Lab, Tencent
J
J. F. Wu
Youtu Lab, Tencent
Q
Q. D. He
Youtu Lab, Tencent
C
C. C. Fu
Youtu Lab, Tencent
C
C. J. Wang
Youtu Lab, Tencent
Y. Liu
Y. Liu
School of Electric Power Engineering, South China University of Technology
Power Systems