🤖 AI Summary
Existing text-to-3D human pose generation methods heavily rely on low-level, joint-specific prompts, limiting their ability to interpret abstract, semantically rich natural language descriptions—thereby hindering practical applicability. To address this, we propose a novel Chain-of-Thought (CoT) reasoning paradigm that enables progressive decoding from high-level action intent to low-level skeletal configurations. To support training, we introduce the first triplet-based data synthesis pipeline tailored for abstract language: (abstract prompt → detailed prompt → 3D pose), integrating multimodal large language models with structured reasoning modules. Extensive experiments demonstrate that our method significantly outperforms state-of-the-art approaches under abstract textual input, generating poses that exhibit both semantic fidelity and kinematic plausibility. Crucially, it bridges the semantic gap between natural language semantics and geometric 3D modeling, advancing the alignment of linguistic abstraction with biomechanically grounded pose synthesis.
📝 Abstract
Recent advances in multi-modal large language models (MLLMs) and chain-of-thought (CoT) reasoning have led to significant progress in image and text generation tasks. However, the field of 3D human pose generation still faces critical limitations. Most existing text-to-pose models rely heavily on detailed (low-level) prompts that explicitly describe joint configurations. In contrast, humans tend to communicate actions and intentions using abstract (high-level) language. This mismatch results in a practical challenge for deploying pose generation systems in real-world scenarios. To bridge this gap, we introduce a novel framework that incorporates CoT reasoning into the pose generation process, enabling the interpretation of abstract prompts into accurate 3D human poses. We further propose a data synthesis pipeline that automatically generates triplets of abstract prompts, detailed prompts, and corresponding 3D poses for training process. Experimental results demonstrate that our reasoning-enhanced model, CoT-Pose, can effectively generate plausible and semantically aligned poses from abstract textual inputs. This work highlights the importance of high-level understanding in pose generation and opens new directions for reasoning-enhanced approach for human pose generation.