Dyadic Mamba: Long-term Dyadic Human Motion Synthesis

📅 2025-05-14
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing text-driven two-person motion generation methods—particularly Transformer-based approaches—are constrained by positional encoding, limiting their ability to model long-horizon interactive dynamics. This work proposes the first State Space Model (SSM)-based framework for long-horizon two-person motion synthesis: it replaces cross-attention with a sequence-concatenation-based co-modeling mechanism to achieve efficient inter-person motion coupling; integrates text-motion joint embedding and self-supervised motion prior learning to enhance semantic alignment and temporal coherence. Furthermore, we introduce the first benchmark dedicated to long-horizon two-person motion synthesis. Experiments demonstrate that our method achieves competitive performance on short-horizon tasks (<5 seconds) and significantly outperforms state-of-the-art Transformers on long-horizon sequences (>10 seconds), while enabling stable, arbitrary-length generation.

Technology Category

Application Category

📝 Abstract
Generating realistic dyadic human motion from text descriptions presents significant challenges, particularly for extended interactions that exceed typical training sequence lengths. While recent transformer-based approaches have shown promising results for short-term dyadic motion synthesis, they struggle with longer sequences due to inherent limitations in positional encoding schemes. In this paper, we introduce Dyadic Mamba, a novel approach that leverages State-Space Models (SSMs) to generate high-quality dyadic human motion of arbitrary length. Our method employs a simple yet effective architecture that facilitates information flow between individual motion sequences through concatenation, eliminating the need for complex cross-attention mechanisms. We demonstrate that Dyadic Mamba achieves competitive performance on standard short-term benchmarks while significantly outperforming transformer-based approaches on longer sequences. Additionally, we propose a new benchmark for evaluating long-term motion synthesis quality, providing a standardized framework for future research. Our results demonstrate that SSM-based architectures offer a promising direction for addressing the challenging task of long-term dyadic human motion synthesis from text descriptions.
Problem

Research questions and friction points this paper is trying to address.

Generating realistic long-term dyadic human motion from text
Overcoming transformer limitations in extended interaction sequences
Proposing SSM-based architecture for arbitrary-length motion synthesis
Innovation

Methods, ideas, or system contributions that make the work stand out.

Uses State-Space Models for long-term motion synthesis
Simplifies architecture with concatenation-based information flow
Introduces new benchmark for long-term motion evaluation
🔎 Similar Papers
No similar papers found.