Articulate-Anything: Automatic Modeling of Articulated Objects via a Vision-Language Foundation Model

📅 2024-10-03
🏛️ arXiv.org
📈 Citations: 1
Influential: 0
📄 PDF
🤖 AI Summary
To address the high cost, labor intensity, and low efficiency of manual interactive 3D object modeling in AR/VR, animation, and robotics, this paper introduces the first end-to-end code-generation framework driven by vision-language models (VLMs), enabling automatic construction of interactive, articulated 3D digital twins from text, images, or videos. Our method integrates cross-modal VLM understanding, retrieval-augmented mesh initialization, actor-critic self-iterative optimization, and differentiable physics simulation feedback to enable self-correcting reasoning over articulation structure and compilation of functional digital twins. Evaluated on PartNet-Mobility, our approach achieves a 75% joint identification success rate—surpassing prior state-of-the-art (11.6%) by a large margin—and, for the first time, demonstrates direct execution of generated models on physical robots to perform fine-grained manipulation tasks.

Technology Category

Application Category

📝 Abstract
Interactive 3D simulated objects are crucial in AR/VR, animations, and robotics, driving immersive experiences and advanced automation. However, creating these articulated objects requires extensive human effort and expertise, limiting their broader applications. To overcome this challenge, we present Articulate-Anything, a system that automates the articulation of diverse, complex objects from many input modalities, including text, images, and videos. Articulate-Anything leverages vision-language models (VLMs) to generate code that can be compiled into an interactable digital twin for use in standard 3D simulators. Our system exploits existing 3D asset datasets via a mesh retrieval mechanism, along with an actor-critic system that iteratively proposes, evaluates, and refines solutions for articulating the objects, self-correcting errors to achieve a robust outcome. Qualitative evaluations demonstrate Articulate-Anything's capability to articulate complex and even ambiguous object affordances by leveraging rich grounded inputs. In extensive quantitative experiments on the standard PartNet-Mobility dataset, Articulate-Anything substantially outperforms prior work, increasing the success rate from 8.7-11.6% to 75% and setting a new bar for state-of-the-art performance. We further showcase the utility of our system by generating 3D assets from in-the-wild video inputs, which are then used to train robotic policies for fine-grained manipulation tasks in simulation that go beyond basic pick and place. These policies are then transferred to a real robotic system.
Problem

Research questions and friction points this paper is trying to address.

Automates creation of articulated 3D objects
Reduces human effort in 3D modeling
Enhances robotics and AR/VR simulations
Innovation

Methods, ideas, or system contributions that make the work stand out.

Automates articulation via vision-language models
Utilizes mesh retrieval and actor-critic system
Generates 3D assets from diverse inputs
🔎 Similar Papers
No similar papers found.
L
Long Le
University of Pennsylvania
J
Jason Xie
University of Pennsylvania
W
William Liang
University of Pennsylvania
H
Hung-Ju Wang
University of Pennsylvania
Y
Yue Yang
University of Pennsylvania
Y
Yecheng Jason Ma
University of Pennsylvania
K
Kyle Vedder
University of Pennsylvania
Arjun Krishna
Arjun Krishna
University of Pennsylvania
reinforcement learningrobotics
Dinesh Jayaraman
Dinesh Jayaraman
Assistant Professor, University of Pennsylvania
robot learningcomputer visionroboticsmachine learning
Eric Eaton
Eric Eaton
University of Pennsylvania
artificial intelligencemachine learningcontinual learningroboticsmedicine