sim2art: Accurate Articulated Object Modeling from a Single Video using Synthetic Training Data Only

📅 2025-12-08

📈 Citations: 0

✨ Influential: 0

career value

189K/year

🤖 AI Summary

This work addresses the challenge of articulation-aware 3D object modeling from monocular, freely moving videos. We propose an end-to-end deep learning method trained exclusively on synthetic data to jointly predict part segmentation and kinematic joint parameters—including rotation axes and motion ranges. Unlike conventional approaches requiring multi-view setups, static cameras, or real-world ground-truth annotations, our method demonstrates, for the first time, strong generalization of purely synthetic training to real-world dynamic scenes. The network processes raw monocular video streams directly, without manual initialization or post-processing, achieving high-precision part segmentation and physically interpretable joint structure recovery on real videos. Experimental results confirm its computational efficiency and real-time inference potential. This establishes a lightweight, scalable, and annotation-free visual perception paradigm for robotic manipulation and digital twin construction.

Technology Category

Application Category

📝 Abstract

Understanding articulated objects is a fundamental challenge in robotics and digital twin creation. To effectively model such objects, it is essential to recover both part segmentation and the underlying joint parameters. Despite the importance of this task, previous work has largely focused on setups like multi-view systems, object scanning, or static cameras. In this paper, we present the first data-driven approach that jointly predicts part segmentation and joint parameters from monocular video captured with a freely moving camera. Trained solely on synthetic data, our method demonstrates strong generalization to real-world objects, offering a scalable and practical solution for articulated object understanding. Our approach operates directly on casually recorded video, making it suitable for real-time applications in dynamic environments. Project webpage: https://aartykov.github.io/sim2art/

Problem

Research questions and friction points this paper is trying to address.

Predicts part segmentation and joint parameters from monocular video

Uses only synthetic data for training to generalize to real-world objects

Operates on casually recorded video for real-time dynamic applications

Innovation

Methods, ideas, or system contributions that make the work stand out.

Monocular video predicts part segmentation and joint parameters

Trained solely on synthetic data for real-world generalization

Operates on casually recorded video for real-time applications

🔎 Similar Papers

Survey on Modeling of Human-made Articulated Objects