Object Pose Transformer: Unifying Unseen Object Pose Estimation

📅 2026-03-24

📈 Citations: 0

✨ Influential: 0

career value

146K/year

🤖 AI Summary

This work addresses the challenge of category-level, model-free 6D object pose estimation for unseen objects by proposing the first unified feed-forward framework that jointly handles both absolute (SA(3)) and relative (SE(3)) pose estimation. The approach decomposes the task within a single Transformer architecture to simultaneously predict depth, point maps, camera intrinsics, and normalized object coordinate space (NOCS) representations. It introduces a contrastive learning strategy to obtain semantic-label-free latent embeddings of object centers, which, when combined with point maps, enables cross-view geometrically consistent reasoning. The method achieves state-of-the-art performance on multiple benchmarks—including NOCS, HouseCat6D, Omni6DPose, and Toyota-Light—for both absolute and relative pose estimation tasks.

Technology Category

Application Category

📝 Abstract

Learning model-free object pose estimation for unseen instances remains a fundamental challenge in 3D vision. Existing methods typically fall into two disjoint paradigms: category-level approaches predict absolute poses in a canonical space but rely on predefined taxonomies, while relative pose methods estimate cross-view transformations but cannot recover single-view absolute pose. In this work, we propose Object Pose Transformer (\ours{}), a unified feed-forward framework that bridges these paradigms through task factorization within a single model. \ours{} jointly predicts depth, point maps, camera parameters, and normalized object coordinates (NOCS) from RGB inputs, enabling both category-level absolute SA(3) pose and unseen-object relative SE(3) pose. Our approach leverages contrastive object-centric latent embeddings for canonicalization without requiring semantic labels at inference time, and uses point maps as a camera-space representation to enable multi-view relative geometric reasoning. Through cross-frame feature interaction and shared object embeddings, our model leverages relative geometric consistency across views to improve absolute pose estimation, reducing ambiguity in single-view predictions. Furthermore, \ours{} is camera-agnostic, learning camera intrinsics on-the-fly and supporting optional depth input for metric-scale recovery, while remaining fully functional in RGB-only settings. Extensive experiments on diverse benchmarks (NOCS, HouseCat6D, Omni6DPose, Toyota-Light) demonstrate state-of-the-art performance in both absolute and relative pose estimation tasks within a single unified architecture.

Problem

Research questions and friction points this paper is trying to address.

object pose estimation

unseen objects

3D vision

absolute pose

relative pose

Innovation

Methods, ideas, or system contributions that make the work stand out.

Object Pose Transformer

unseen object pose estimation

task factorization