🤖 AI Summary
Existing long-horizon pose generation methods rely on intermediate representations—such as quantization or autoregressive modeling—leading to error accumulation and temporal incoherence during inference. This paper introduces the first end-to-end, single-pass, RGB- and text-driven framework for 3D pose generation, directly predicting long-sequence 3D poses in continuous coordinate space. Key contributions include: (1) a relative motion prediction mechanism that mitigates absolute coordinate drift; (2) unified placeholder tokens and relative positional encodings to ensure training-inference behavioral consistency; and (3) a lightweight context fusion module that replaces tokenization and multi-stage modeling. Our method achieves significant improvements over quantized and autoregressive baselines on Penn Action and F-PHAB, demonstrating superior temporal coherence and action plausibility in long-term generation tasks.
📝 Abstract
Current approaches to pose generation rely heavily on intermediate representations, either through two-stage pipelines with quantization or autoregressive models that accumulate errors during inference. This fundamental limitation leads to degraded performance, particularly in long-term pose generation where maintaining temporal coherence is crucial. We propose a novel one-stage architecture that directly generates poses in continuous coordinate space from minimal context - a single RGB image and text description - while maintaining consistent distributions between training and inference. Our key innovation is eliminating the need for intermediate representations or token-based generation by operating directly on pose coordinates through a relative movement prediction mechanism that preserves spatial relationships, and a unified placeholder token approach that enables single-forward generation with identical behavior during training and inference. Through extensive experiments on Penn Action and First-Person Hand Action Benchmark (F-PHAB) datasets, we demonstrate that our approach significantly outperforms existing quantization-based and autoregressive methods, especially in long-term generation scenarios.