Leverage Cross-Attention for End-to-End Open-Vocabulary Panoptic Reconstruction

📅 2025-01-02

📈 Citations: 0

✨ Influential: 0

career value

174K/year

🤖 AI Summary

Existing panoramic 3D reconstruction methods struggle to simultaneously achieve geometric accuracy, semantic consistency, and instance completeness in complex scenes—particularly in modeling spatial relationships among objects and dynamically adapting to varying numbers of instances. To address these challenges, we propose an end-to-end open-vocabulary panoramic 3D reconstruction framework. Our approach introduces learnable 3D Gaussians as instance query carriers for the first time, integrating spatial priors with cross-frame optimal assignment. We design a query-driven panoramic head that jointly models semantics, instances, and geometry, enhanced by cross-attention mechanisms, dynamic query cardinality adaptation, and a unified panoramic loss. Extensive experiments on multiple synthetic and real-world datasets demonstrate state-of-the-art performance in both 3D and 2D segmentation and reconstruction. The framework has been successfully integrated into a robotic embodied simulation system, and the code is publicly available.

Technology Category

Application Category

📝 Abstract

Open-vocabulary panoptic reconstruction offers comprehensive scene understanding, enabling advances in embodied robotics and photorealistic simulation. In this paper, we propose PanopticRecon++, an end-to-end method that formulates panoptic reconstruction through a novel cross-attention perspective. This perspective models the relationship between 3D instances (as queries) and the scene's 3D embedding field (as keys) through their attention map. Unlike existing methods that separate the optimization of queries and keys or overlook spatial proximity, PanopticRecon++ introduces learnable 3D Gaussians as instance queries. This formulation injects 3D spatial priors to preserve proximity while maintaining end-to-end optimizability. Moreover, this query formulation facilitates the alignment of 2D open-vocabulary instance IDs across frames by leveraging optimal linear assignment with instance masks rendered from the queries. Additionally, we ensure semantic-instance segmentation consistency by fusing query-based instance segmentation probabilities with semantic probabilities in a novel panoptic head supervised by a panoptic loss. During training, the number of instance query tokens dynamically adapts to match the number of objects. PanopticRecon++ shows competitive performance in terms of 3D and 2D segmentation and reconstruction performance on both simulation and real-world datasets, and demonstrates a user case as a robot simulator. Our project website is at: https://yuxuan1206.github.io/panopticrecon_pp/

Problem

Research questions and friction points this paper is trying to address.

Panoramic Reconstruction

Object Recognition

Spatial Relationships

Innovation

Methods, ideas, or system contributions that make the work stand out.

PanopticRecon++

Cross Attention

3D Scene Reconstruction

🔎 Similar Papers

No similar papers found.

World Labs

$250,000-$350,000 base salary (good-faith estimate for San Francisco Bay Area upon hire; actual offer based on experience, skills, and qualifications)

San Francisco / San Francisco Office, San Francisco, California, United States

Research Scientist Intern, Multimodal Generative AI and Robotics (PhD)