ArtVIP: Articulated Digital Assets of Visual Realism, Modular Interaction, and Physical Fidelity for Robot Learning

📅 2025-06-05

📈 Citations: 0

✨ Influential: 0

career value

198K/year

🤖 AI Summary

Existing open-source articulable object datasets suffer from significant deficiencies in visual photorealism and physical fidelity, severely limiting their practical utility in robot learning. To address this, we introduce the first high-fidelity digital twin dataset of articulable objects specifically designed for robot learning, covering representative indoor scenes while jointly ensuring visual realism, physical accuracy, and modular interactivity. Our approach innovatively incorporates embedded modular interaction behavior modeling and pixel-level functional region annotation; leverages USD for unified asset encapsulation; and integrates optical motion capture validation, PBR-based high-resolution texturing, and fine-grained rigid-body dynamics parameter calibration. Experiments demonstrate substantial improvements in Sim2Real transfer performance, with consistent gains across both imitation learning and reinforcement learning benchmarks. The complete dataset—including all assets, annotations, and a comprehensive production pipeline—is publicly released under an open-source license.

Technology Category

Application Category

📝 Abstract

Robot learning increasingly relies on simulation to advance complex ability such as dexterous manipulations and precise interactions, necessitating high-quality digital assets to bridge the sim-to-real gap. However, existing open-source articulated-object datasets for simulation are limited by insufficient visual realism and low physical fidelity, which hinder their utility for training models mastering robotic tasks in real world. To address these challenges, we introduce ArtVIP, a comprehensive open-source dataset comprising high-quality digital-twin articulated objects, accompanied by indoor-scene assets. Crafted by professional 3D modelers adhering to unified standards, ArtVIP ensures visual realism through precise geometric meshes and high-resolution textures, while physical fidelity is achieved via fine-tuned dynamic parameters. Meanwhile, the dataset pioneers embedded modular interaction behaviors within assets and pixel-level affordance annotations. Feature-map visualization and optical motion capture are employed to quantitatively demonstrate ArtVIP's visual and physical fidelity, with its applicability validated across imitation learning and reinforcement learning experiments. Provided in USD format with detailed production guidelines, ArtVIP is fully open-source, benefiting the research community and advancing robot learning research. Our project is at https://x-humanoid-artvip.github.io/ .

Problem

Research questions and friction points this paper is trying to address.

Lack of visually realistic and physically accurate digital assets for robot learning

Limited open-source datasets for articulated-object simulation in robotics

Need for modular interaction behaviors and affordance annotations in simulations

Innovation

Methods, ideas, or system contributions that make the work stand out.

High-quality digital-twin articulated objects

Embedded modular interaction behaviors

Pixel-level affordance annotations

🔎 Similar Papers

Mastering Contact-rich Tasks by Combining Soft and Rigid Robotics with Imitation Learning