ArtHOI: Taming Foundation Models for Monocular 4D Reconstruction of Hand-Articulated-Object Interactions

📅 2026-03-26

📈 Citations: 0

✨ Influential: 0

career value

187K/year

🤖 AI Summary

This work addresses the challenge of 4D reconstruction of hand–object interactions from monocular RGB videos, where existing methods struggle with non-rigid deformations and often rely on multi-view inputs or pre-scanned object models. To overcome these limitations, we propose ArtHOI, a novel framework that integrates foundation model priors and employs adaptive sampling refinement to resolve scale and pose ambiguities. Furthermore, ArtHOI introduces a multimodal large language model–guided contact alignment mechanism to enhance both physical plausibility and geometric accuracy. We also contribute two new datasets for comprehensive evaluation. Extensive experiments demonstrate that ArtHOI achieves high-fidelity and robust 4D reconstructions across diverse objects and complex interaction scenarios, with consistent validation on both real-world and synthetic data.

Technology Category

Application Category

📝 Abstract

Existing hand-object interactions (HOI) methods are largely limited to rigid objects, while 4D reconstruction methods of articulated objects generally require pre-scanning the object or even multi-view videos. It remains an unexplored but significant challenge to reconstruct 4D human-articulated-object interactions from a single monocular RGB video. Fortunately, recent advancements in foundation models present a new opportunity to address this highly ill-posed problem. To this end, we introduce ArtHOI, an optimization-based framework that integrates and refines priors from multiple foundation models. Our key contribution is a suite of novel methodologies designed to resolve the inherent inaccuracies and physical unreality of these priors. In particular, we introduce an Adaptive Sampling Refinement (ASR) method to optimize object's metric scale and pose for grounding its normalized mesh in world space. Furthermore, we propose a Multimodal Large Language Model (MLLM) guided hand-object alignment method, utilizing contact reasoning information as constraints of hand-object mesh composition optimization. To facilitate a comprehensive evaluation, we also contribute two new datasets, ArtHOI-RGBD and ArtHOI-Wild. Extensive experiments validate the robustness and effectiveness of our ArtHOI across diverse objects and interactions. Project: https://arthoi-reconstruction.github.io.

Problem

Research questions and friction points this paper is trying to address.

4D reconstruction

hand-object interaction

articulated objects

monocular video

foundation models

Innovation

Methods, ideas, or system contributions that make the work stand out.

monocular 4D reconstruction

hand-object interaction

foundation models

articulated objects

multimodal LLM

🔎 Similar Papers

No similar papers found.