URDF-Anything: Constructing Articulated Objects with 3D Multimodal Language Model

📅 2025-11-02
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the low efficiency and weak coupling between geometry and motion parameters in digital twin modeling of articulated objects. We propose an end-to-end joint reconstruction method leveraging a 3D multimodal large language model. Our key innovations include: (i) a dedicated [SEG] token mechanism that enables point-cloud-feature-driven fine-grained part segmentation and consistent motion parameter prediction; and (ii) an autoregressive framework integrating point cloud and textual inputs to jointly optimize geometric and kinematic representations. Evaluated on both synthetic and real-world datasets, our method significantly outperforms prior approaches—achieving a 17% improvement in segmentation mIoU, a 29% reduction in motion parameter error, and a 50% increase in physical executability—while demonstrating strong cross-object generalization. The method directly supports robot simulation training and embodied AI world model construction.

Technology Category

Application Category

📝 Abstract
Constructing accurate digital twins of articulated objects is essential for robotic simulation training and embodied AI world model building, yet historically requires painstaking manual modeling or multi-stage pipelines. In this work, we propose extbf{URDF-Anything}, an end-to-end automatic reconstruction framework based on a 3D multimodal large language model (MLLM). URDF-Anything utilizes an autoregressive prediction framework based on point-cloud and text multimodal input to jointly optimize geometric segmentation and kinematic parameter prediction. It implements a specialized $[SEG]$ token mechanism that interacts directly with point cloud features, enabling fine-grained part-level segmentation while maintaining consistency with the kinematic parameter predictions. Experiments on both simulated and real-world datasets demonstrate that our method significantly outperforms existing approaches regarding geometric segmentation (mIoU 17% improvement), kinematic parameter prediction (average error reduction of 29%), and physical executability (surpassing baselines by 50%). Notably, our method exhibits excellent generalization ability, performing well even on objects outside the training set. This work provides an efficient solution for constructing digital twins for robotic simulation, significantly enhancing the sim-to-real transfer capability.
Problem

Research questions and friction points this paper is trying to address.

Automates articulated object reconstruction for robotics simulation
Improves geometric segmentation and kinematic parameter prediction accuracy
Enhances sim-to-real transfer with strong generalization capability
Innovation

Methods, ideas, or system contributions that make the work stand out.

End-to-end automatic reconstruction using 3D multimodal language model
Autoregressive prediction framework with point-cloud and text input
Specialized token mechanism for joint segmentation and kinematics optimization
🔎 Similar Papers
No similar papers found.
Z
Zhe Li
State Key Laboratory of Multimedia Information Processing, School of Computer Science, Peking University
Xiang Bai
Xiang Bai
Huazhong University of Science and Technology (HUST)
Computer VisionOCR
Jieyu Zhang
Jieyu Zhang
University of Washington
Data-Centric AIAgentic AIMultimodal ModelsMachine LearningComputer Vision
Z
Zhuangzhe Wu
State Key Laboratory of Multimedia Information Processing, School of Computer Science, Peking University
C
Che Xu
State Key Laboratory of Multimedia Information Processing, School of Computer Science, Peking University
Y
Ying Li
State Key Laboratory of Multimedia Information Processing, School of Computer Science, Peking University
Chengkai Hou
Chengkai Hou
Peking University
Robot
Shanghang Zhang
Shanghang Zhang
Peking University
Embodied AIFoundation Models