🤖 AI Summary
This paper addresses the challenge of end-to-end mapping from natural language instructions to precise 3D object pose control. Methodologically, it introduces RSActrl—a training-free self-attention rewiring mechanism that disentangles structural and pose representations in source images—and replaces differentiable rendering with keypoint-based multi-view pose alignment optimization. Conditional image synthesis is performed using a powerful multimodal generative model. The core contribution is the first zero-shot, text-controllable 3D pose editing framework, eliminating the need for explicit 3D modeling, 3D supervision, or model fine-tuning. Extensive evaluation across diverse object categories and open-ended text prompts demonstrates robust performance. A user study shows that over 85% of edited results are rated superior to those produced by state-of-the-art methods, significantly improving both controllability and visual fidelity.
📝 Abstract
We propose a training-free method, Articulate3D, to pose a 3D asset through language control. Despite advances in vision and language models, this task remains surprisingly challenging. To achieve this goal, we decompose the problem into two steps. We modify a powerful image-generator to create target images conditioned on the input image and a text instruction. We then align the mesh to the target images through a multi-view pose optimisation step. In detail, we introduce a self-attention rewiring mechanism (RSActrl) that decouples the source structure from pose within an image generative model, allowing it to maintain a consistent structure across varying poses. We observed that differentiable rendering is an unreliable signal for articulation optimisation; instead, we use keypoints to establish correspondences between input and target images. The effectiveness of Articulate3D is demonstrated across a diverse range of 3D objects and free-form text prompts, successfully manipulating poses while maintaining the original identity of the mesh. Quantitative evaluations and a comparative user study, in which our method was preferred over 85% of the time, confirm its superiority over existing approaches. Project page:https://odeb1.github.io/articulate3d_page_deb/