Towards Consistent Long-Term Pose Generation

📅 2025-07-24

📈 Citations: 0

✨ Influential: 0

career value

187K/year

🤖 AI Summary

Existing long-horizon pose generation methods rely on intermediate representations—such as quantization or autoregressive modeling—leading to error accumulation and temporal incoherence during inference. This paper introduces the first end-to-end, single-pass, RGB- and text-driven framework for 3D pose generation, directly predicting long-sequence 3D poses in continuous coordinate space. Key contributions include: (1) a relative motion prediction mechanism that mitigates absolute coordinate drift; (2) unified placeholder tokens and relative positional encodings to ensure training-inference behavioral consistency; and (3) a lightweight context fusion module that replaces tokenization and multi-stage modeling. Our method achieves significant improvements over quantized and autoregressive baselines on Penn Action and F-PHAB, demonstrating superior temporal coherence and action plausibility in long-term generation tasks.

Technology Category

Application Category

📝 Abstract

Current approaches to pose generation rely heavily on intermediate representations, either through two-stage pipelines with quantization or autoregressive models that accumulate errors during inference. This fundamental limitation leads to degraded performance, particularly in long-term pose generation where maintaining temporal coherence is crucial. We propose a novel one-stage architecture that directly generates poses in continuous coordinate space from minimal context - a single RGB image and text description - while maintaining consistent distributions between training and inference. Our key innovation is eliminating the need for intermediate representations or token-based generation by operating directly on pose coordinates through a relative movement prediction mechanism that preserves spatial relationships, and a unified placeholder token approach that enables single-forward generation with identical behavior during training and inference. Through extensive experiments on Penn Action and First-Person Hand Action Benchmark (F-PHAB) datasets, we demonstrate that our approach significantly outperforms existing quantization-based and autoregressive methods, especially in long-term generation scenarios.

Problem

Research questions and friction points this paper is trying to address.

Eliminate error accumulation in long-term pose generation

Maintain temporal coherence without intermediate representations

Generate poses directly from minimal RGB and text input

Innovation

Methods, ideas, or system contributions that make the work stand out.

One-stage architecture for direct pose generation

Relative movement prediction preserves spatial relationships

Unified placeholder token enables single-forward generation

🔎 Similar Papers

No similar papers found.