SCOPE: Semantic Conditioning for Sim2Real Category-Level Object Pose Estimation in Robotics

📅 2025-09-29

📈 Citations: 0

✨ Influential: 0

career value

175K/year

🤖 AI Summary

Existing methods for 6D pose estimation of unknown-category objects in open-world settings rely on discrete category labels, severely limiting generalization. Method: We propose the first category-agnostic pose estimation framework leveraging continuous semantic priors. Specifically, we introduce DINOv2 visual features as dense, transferable semantic priors; model point-cloud normal noise via a diffusion process; and achieve deep RGB-D–semantic fusion through cross-attention, enabling cross-instance canonical coordinate learning. Crucially, our method eliminates dependence on category annotations and is trained solely on photorealistic synthetic data. Contribution/Results: On standard benchmarks, it achieves a 31.9% improvement in 5°5 cm accuracy over state-of-the-art methods. Moreover, it attains 100% grasp success rates on two instance-level unknown-object datasets, significantly narrowing the Sim2Real gap.

Technology Category

Application Category

📝 Abstract

Object manipulation requires accurate object pose estimation. In open environments, robots encounter unknown objects, which requires semantic understanding in order to generalize both to known categories and beyond. To resolve this challenge, we present SCOPE, a diffusion-based category-level object pose estimation model that eliminates the need for discrete category labels by leveraging DINOv2 features as continuous semantic priors. By combining these DINOv2 features with photorealistic training data and a noise model for point normals, we reduce the Sim2Real gap in category-level object pose estimation. Furthermore, injecting the continuous semantic priors via cross-attention enables SCOPE to learn canonicalized object coordinate systems across object instances beyond the distribution of known categories. SCOPE outperforms the current state of the art in synthetically trained category-level object pose estimation, achieving a relative improvement of 31.9% on the 5$^circ$5cm metric. Additional experiments on two instance-level datasets demonstrate generalization beyond known object categories, enabling grasping of unseen objects from unknown categories with a success rate of up to 100%. Code available: https://github.com/hoenigpeter/scope.

Problem

Research questions and friction points this paper is trying to address.

Estimating object poses for robotic manipulation in open environments

Bridging Sim2Real gap using semantic priors without category labels

Generalizing pose estimation to unseen object categories

Innovation

Methods, ideas, or system contributions that make the work stand out.

Uses DINOv2 features as continuous semantic priors

Combines photorealistic data with noise model

Injects semantic priors via cross-attention mechanism

🔎 Similar Papers

OP-Align: Object-level and Part-level Alignment for Self-supervised Category-level Articulated Object Pose Estimation

2024-08-29European Conference on Computer VisionCitations: 2

Bosch Group

Hildesheim, NDS, DE

Research Scientist Intern, Multimodal Generative AI and Robotics (PhD)