Articulate-Anything: Automatic Modeling of Articulated Objects via a Vision-Language Foundation Model

📅 2024-10-03

🏛️ arXiv.org

📈 Citations: 1

✨ Influential: 0

career value

200K/year

🤖 AI Summary

To address the high cost, labor intensity, and low efficiency of manual interactive 3D object modeling in AR/VR, animation, and robotics, this paper introduces the first end-to-end code-generation framework driven by vision-language models (VLMs), enabling automatic construction of interactive, articulated 3D digital twins from text, images, or videos. Our method integrates cross-modal VLM understanding, retrieval-augmented mesh initialization, actor-critic self-iterative optimization, and differentiable physics simulation feedback to enable self-correcting reasoning over articulation structure and compilation of functional digital twins. Evaluated on PartNet-Mobility, our approach achieves a 75% joint identification success rate—surpassing prior state-of-the-art (11.6%) by a large margin—and, for the first time, demonstrates direct execution of generated models on physical robots to perform fine-grained manipulation tasks.

Technology Category

Application Category

📝 Abstract

Interactive 3D simulated objects are crucial in AR/VR, animations, and robotics, driving immersive experiences and advanced automation. However, creating these articulated objects requires extensive human effort and expertise, limiting their broader applications. To overcome this challenge, we present Articulate-Anything, a system that automates the articulation of diverse, complex objects from many input modalities, including text, images, and videos. Articulate-Anything leverages vision-language models (VLMs) to generate code that can be compiled into an interactable digital twin for use in standard 3D simulators. Our system exploits existing 3D asset datasets via a mesh retrieval mechanism, along with an actor-critic system that iteratively proposes, evaluates, and refines solutions for articulating the objects, self-correcting errors to achieve a robust outcome. Qualitative evaluations demonstrate Articulate-Anything's capability to articulate complex and even ambiguous object affordances by leveraging rich grounded inputs. In extensive quantitative experiments on the standard PartNet-Mobility dataset, Articulate-Anything substantially outperforms prior work, increasing the success rate from 8.7-11.6% to 75% and setting a new bar for state-of-the-art performance. We further showcase the utility of our system by generating 3D assets from in-the-wild video inputs, which are then used to train robotic policies for fine-grained manipulation tasks in simulation that go beyond basic pick and place. These policies are then transferred to a real robotic system.

Problem

Research questions and friction points this paper is trying to address.

Automates creation of articulated 3D objects

Reduces human effort in 3D modeling

Enhances robotics and AR/VR simulations

Innovation

Methods, ideas, or system contributions that make the work stand out.

Automates articulation via vision-language models

Utilizes mesh retrieval and actor-critic system

Generates 3D assets from diverse inputs

🔎 Similar Papers

Survey on Modeling of Human-made Articulated Objects