🤖 AI Summary
This work addresses the challenge of regressing category-level 6D poses of articulated objects in continuous space, where existing methods struggle to effectively incorporate kinematic constraints and navigate complex search spaces. To this end, we propose the first conditional discrete diffusion framework for this task, which recovers object poses through a learned reverse process that iteratively denoises discrete pose representations. Our approach explicitly integrates generative priors with physical constraints via a hierarchical kinematic coupling strategy and a dynamic flow-based decision mechanism. Extensive experiments demonstrate that the proposed method achieves state-of-the-art performance on both synthetic and real-world datasets, significantly improving the accuracy and robustness of articulated object pose estimation.
📝 Abstract
Articulated object pose estimation is a core task in embodied AI. Existing methods typically regress poses in a continuous space, but often struggle with 1) navigating a large, complex search space and 2) failing to incorporate intrinsic kinematic constraints. In this work, we introduce DICArt (DIsCrete Diffusion for Articulation Pose Estimation), a novel framework that formulates pose estimation as a conditional discrete diffusion process. Instead of operating in a continuous domain, DICArt progressively denoises a noisy pose representation through a learned reverse diffusion procedure to recover the GT pose. To improve modeling fidelity, we propose a flexible flow decider that dynamically determines whether each token should be denoised or reset, effectively balancing the real and noise distributions during diffusion. Additionally, we incorporate a hierarchical kinematic coupling strategy, estimating the pose of each rigid part hierarchically to respect the object's kinematic structure. We validate DICArt on both synthetic and real-world datasets. Experimental results demonstrate its superior performance and robustness. By integrating discrete generative modeling with structural priors, DICArt offers a new paradigm for reliable category-level 6D pose estimation in complex environments.