Learning Fine-Grained Correspondence with Cross-Perspective Perception for Open-Vocabulary 6D Object Pose Estimation

📅 2026-01-20
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing open-vocabulary 6D object pose estimation methods rely on global matching, which is prone to interference in cluttered scenes, leading to ambiguous correspondences and degraded accuracy. To address this, this work proposes the FiCoP framework, which replaces global matching with spatially constrained patch-level correspondences. FiCoP introduces a patch-to-patch correlation matrix as a structural prior to restrict the matching scope and incorporates three key components: an object-center decoupling preprocessing step, a Cross-view Global Perception (CPGP) module, and a Patch Correlation Predictor (PCP), collectively enabling fine-grained and noise-robust pose estimation. Evaluated on the REAL275 and Toyota-Light datasets, FiCoP achieves average recall improvements of 8.0% and 6.1%, respectively, significantly outperforming current state-of-the-art methods.

Technology Category

Application Category

📝 Abstract
Open-vocabulary 6D object pose estimation empowers robots to manipulate arbitrary unseen objects guided solely by natural language. However, a critical limitation of existing approaches is their reliance on unconstrained global matching strategies. In open-world scenarios, trying to match anchor features against the entire query image space introduces excessive ambiguity, as target features are easily confused with background distractors. To resolve this, we propose Fine-grained Correspondence Pose Estimation (FiCoP), a framework that transitions from noise-prone global matching to spatially-constrained patch-level correspondence. Our core innovation lies in leveraging a patch-to-patch correlation matrix as a structural prior to narrowing the matching scope, effectively filtering out irrelevant clutter to prevent it from degrading pose estimation. Firstly, we introduce an object-centric disentanglement preprocessing to isolate the semantic target from environmental noise. Secondly, a Cross-Perspective Global Perception (CPGP) module is proposed to fuse dual-view features, establishing structural consensus through explicit context reasoning. Finally, we design a Patch Correlation Predictor (PCP) that generates a precise block-wise association map, acting as a spatial filter to enforce fine-grained, noise-resilient matching. Experiments on the REAL275 and Toyota-Light datasets demonstrate that FiCoP improves Average Recall by 8.0% and 6.1%, respectively, compared to the state-of-the-art method, highlighting its capability to deliver robust and generalized perception for robotic agents operating in complex, unconstrained open-world environments. The source code will be made publicly available at https://github.com/zjjqinyu/FiCoP.
Problem

Research questions and friction points this paper is trying to address.

open-vocabulary
6D object pose estimation
global matching
background distractors
spatial ambiguity
Innovation

Methods, ideas, or system contributions that make the work stand out.

fine-grained correspondence
cross-perspective perception
open-vocabulary pose estimation
patch-level matching
spatial filtering
🔎 Similar Papers
No similar papers found.
Yu Qin
Yu Qin
Peking University
Additive ManufacturingBone Implant
S
Shimeng Fan
School of Computer Science and Engineering, Hunan University of Science and Technology, Xiangtan 411201, China
Fan Yang
Fan Yang
湖南大学
机器学习
Z
Zixuan Xue
School of Artificial Intelligence and Robotics, Hunan University, Changsha 410012, China
Z
Zijie Mai
School of Artificial Intelligence and Robotics, Hunan University, Changsha 410012, China
Wenrui Chen
Wenrui Chen
Hunan University
RoboticsHandsGraspingDexterous ManipulationHuman-Robot Collaboration
Kailun Yang
Kailun Yang
Professor. School of Artificial Intelligence and Robotics, Hunan University (HNU); KIT; UAH; ZJU
Computer VisionComputational OpticsIntelligent VehiclesAutonomous DrivingRobotics
Zhiyong Li
Zhiyong Li
Professor of Computer Science, Hunan University
computer vision,object detection