Articulate3D: Zero-Shot Text-Driven 3D Object Posing

📅 2025-08-26
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This paper addresses the challenge of end-to-end mapping from natural language instructions to precise 3D object pose control. Methodologically, it introduces RSActrl—a training-free self-attention rewiring mechanism that disentangles structural and pose representations in source images—and replaces differentiable rendering with keypoint-based multi-view pose alignment optimization. Conditional image synthesis is performed using a powerful multimodal generative model. The core contribution is the first zero-shot, text-controllable 3D pose editing framework, eliminating the need for explicit 3D modeling, 3D supervision, or model fine-tuning. Extensive evaluation across diverse object categories and open-ended text prompts demonstrates robust performance. A user study shows that over 85% of edited results are rated superior to those produced by state-of-the-art methods, significantly improving both controllability and visual fidelity.

Technology Category

Application Category

📝 Abstract
We propose a training-free method, Articulate3D, to pose a 3D asset through language control. Despite advances in vision and language models, this task remains surprisingly challenging. To achieve this goal, we decompose the problem into two steps. We modify a powerful image-generator to create target images conditioned on the input image and a text instruction. We then align the mesh to the target images through a multi-view pose optimisation step. In detail, we introduce a self-attention rewiring mechanism (RSActrl) that decouples the source structure from pose within an image generative model, allowing it to maintain a consistent structure across varying poses. We observed that differentiable rendering is an unreliable signal for articulation optimisation; instead, we use keypoints to establish correspondences between input and target images. The effectiveness of Articulate3D is demonstrated across a diverse range of 3D objects and free-form text prompts, successfully manipulating poses while maintaining the original identity of the mesh. Quantitative evaluations and a comparative user study, in which our method was preferred over 85% of the time, confirm its superiority over existing approaches. Project page:https://odeb1.github.io/articulate3d_page_deb/
Problem

Research questions and friction points this paper is trying to address.

Posing 3D objects through language control without training
Decoupling source structure from pose in image generation
Optimizing mesh alignment using keypoints instead of differentiable rendering
Innovation

Methods, ideas, or system contributions that make the work stand out.

Training-free method for 3D posing
Self-attention rewiring mechanism RSActrl
Keypoint-based multi-view pose optimization
🔎 Similar Papers
No similar papers found.