🤖 AI Summary
To address the high cost, labor intensity, and low efficiency of manual interactive 3D object modeling in AR/VR, animation, and robotics, this paper introduces the first end-to-end code-generation framework driven by vision-language models (VLMs), enabling automatic construction of interactive, articulated 3D digital twins from text, images, or videos. Our method integrates cross-modal VLM understanding, retrieval-augmented mesh initialization, actor-critic self-iterative optimization, and differentiable physics simulation feedback to enable self-correcting reasoning over articulation structure and compilation of functional digital twins. Evaluated on PartNet-Mobility, our approach achieves a 75% joint identification success rate—surpassing prior state-of-the-art (11.6%) by a large margin—and, for the first time, demonstrates direct execution of generated models on physical robots to perform fine-grained manipulation tasks.
📝 Abstract
Interactive 3D simulated objects are crucial in AR/VR, animations, and robotics, driving immersive experiences and advanced automation. However, creating these articulated objects requires extensive human effort and expertise, limiting their broader applications. To overcome this challenge, we present Articulate-Anything, a system that automates the articulation of diverse, complex objects from many input modalities, including text, images, and videos. Articulate-Anything leverages vision-language models (VLMs) to generate code that can be compiled into an interactable digital twin for use in standard 3D simulators. Our system exploits existing 3D asset datasets via a mesh retrieval mechanism, along with an actor-critic system that iteratively proposes, evaluates, and refines solutions for articulating the objects, self-correcting errors to achieve a robust outcome. Qualitative evaluations demonstrate Articulate-Anything's capability to articulate complex and even ambiguous object affordances by leveraging rich grounded inputs. In extensive quantitative experiments on the standard PartNet-Mobility dataset, Articulate-Anything substantially outperforms prior work, increasing the success rate from 8.7-11.6% to 75% and setting a new bar for state-of-the-art performance. We further showcase the utility of our system by generating 3D assets from in-the-wild video inputs, which are then used to train robotic policies for fine-grained manipulation tasks in simulation that go beyond basic pick and place. These policies are then transferred to a real robotic system.