CoT-Pose: Chain-of-Thought Reasoning for 3D Pose Generation from Abstract Prompts

📅 2025-08-10

📈 Citations: 0

✨ Influential: 0

career value

180K/year

🤖 AI Summary

Existing text-to-3D human pose generation methods heavily rely on low-level, joint-specific prompts, limiting their ability to interpret abstract, semantically rich natural language descriptions—thereby hindering practical applicability. To address this, we propose a novel Chain-of-Thought (CoT) reasoning paradigm that enables progressive decoding from high-level action intent to low-level skeletal configurations. To support training, we introduce the first triplet-based data synthesis pipeline tailored for abstract language: (abstract prompt → detailed prompt → 3D pose), integrating multimodal large language models with structured reasoning modules. Extensive experiments demonstrate that our method significantly outperforms state-of-the-art approaches under abstract textual input, generating poses that exhibit both semantic fidelity and kinematic plausibility. Crucially, it bridges the semantic gap between natural language semantics and geometric 3D modeling, advancing the alignment of linguistic abstraction with biomechanically grounded pose synthesis.

Technology Category

Application Category

📝 Abstract

Recent advances in multi-modal large language models (MLLMs) and chain-of-thought (CoT) reasoning have led to significant progress in image and text generation tasks. However, the field of 3D human pose generation still faces critical limitations. Most existing text-to-pose models rely heavily on detailed (low-level) prompts that explicitly describe joint configurations. In contrast, humans tend to communicate actions and intentions using abstract (high-level) language. This mismatch results in a practical challenge for deploying pose generation systems in real-world scenarios. To bridge this gap, we introduce a novel framework that incorporates CoT reasoning into the pose generation process, enabling the interpretation of abstract prompts into accurate 3D human poses. We further propose a data synthesis pipeline that automatically generates triplets of abstract prompts, detailed prompts, and corresponding 3D poses for training process. Experimental results demonstrate that our reasoning-enhanced model, CoT-Pose, can effectively generate plausible and semantically aligned poses from abstract textual inputs. This work highlights the importance of high-level understanding in pose generation and opens new directions for reasoning-enhanced approach for human pose generation.

Problem

Research questions and friction points this paper is trying to address.

Bridging abstract prompts to 3D human pose generation

Overcoming reliance on detailed joint configuration prompts

Enabling high-level language understanding for pose synthesis

Innovation

Methods, ideas, or system contributions that make the work stand out.

Incorporates CoT reasoning for abstract prompt interpretation

Proposes data synthesis pipeline for training triplets

Generates accurate 3D poses from abstract textual inputs

🔎 Similar Papers

No similar papers found.