JOPP-3D: Joint Open Vocabulary Semantic Segmentation on Point Clouds and Panoramas

📅 2026-03-06

📈 Citations: 0

✨ Influential: 0

career value

188K/year

🤖 AI Summary

This work addresses the challenges of scarce annotated data and limited generalization of fixed-label models in cross-modal (3D point clouds and panoramic images) semantic segmentation by proposing the first open-vocabulary semantic segmentation framework driven by language for cross-modal scene understanding. The method converts RGB-D panoramic images into tangent-plane views and aligns them with 3D point clouds, enabling joint extraction of features from vision-language foundation models to achieve cross-modal semantic alignment and consistent segmentation. The framework supports generating semantic masks from natural language queries and significantly outperforms existing methods on the Stanford-2D-3D-s and ToF-360 datasets, achieving state-of-the-art performance under both open- and closed-vocabulary settings.

Technology Category

Application Category

📝 Abstract

Semantic segmentation across visual modalities such as 3D point clouds and panoramic images remains a challenging task, primarily due to the scarcity of annotated data and the limited adaptability of fixed-label models. In this paper, we present JOPP-3D, an open-vocabulary semantic segmentation framework that jointly leverages panoramic and point cloud data to enable language-driven scene understanding. We convert RGB-D panoramic images into their corresponding tangential perspective images and 3D point clouds, then use these modalities to extract and align foundational vision-language features. This allows natural language querying to generate semantic masks on both input modalities. Experimental evaluation on the Stanford-2D-3D-s and ToF-360 datasets demonstrates the capability of JOPP-3D to produce coherent and semantically meaningful segmentations across panoramic and 3D domains. Our proposed method achieves a significant improvement compared to the SOTA in open and closed vocabulary 2D and 3D semantic segmentation.

Problem

Research questions and friction points this paper is trying to address.

open-vocabulary semantic segmentation

3D point clouds

panoramic images

cross-modal understanding

Innovation

Methods, ideas, or system contributions that make the work stand out.

open-vocabulary segmentation

point cloud

panoramic image

vision-language alignment