JOPP-3D: Joint Open Vocabulary Semantic Segmentation on Point Clouds and Panoramas

📅 2026-03-06
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the challenges of scarce annotated data and limited generalization of fixed-label models in cross-modal (3D point clouds and panoramic images) semantic segmentation by proposing the first open-vocabulary semantic segmentation framework driven by language for cross-modal scene understanding. The method converts RGB-D panoramic images into tangent-plane views and aligns them with 3D point clouds, enabling joint extraction of features from vision-language foundation models to achieve cross-modal semantic alignment and consistent segmentation. The framework supports generating semantic masks from natural language queries and significantly outperforms existing methods on the Stanford-2D-3D-s and ToF-360 datasets, achieving state-of-the-art performance under both open- and closed-vocabulary settings.

Technology Category

Application Category

📝 Abstract
Semantic segmentation across visual modalities such as 3D point clouds and panoramic images remains a challenging task, primarily due to the scarcity of annotated data and the limited adaptability of fixed-label models. In this paper, we present JOPP-3D, an open-vocabulary semantic segmentation framework that jointly leverages panoramic and point cloud data to enable language-driven scene understanding. We convert RGB-D panoramic images into their corresponding tangential perspective images and 3D point clouds, then use these modalities to extract and align foundational vision-language features. This allows natural language querying to generate semantic masks on both input modalities. Experimental evaluation on the Stanford-2D-3D-s and ToF-360 datasets demonstrates the capability of JOPP-3D to produce coherent and semantically meaningful segmentations across panoramic and 3D domains. Our proposed method achieves a significant improvement compared to the SOTA in open and closed vocabulary 2D and 3D semantic segmentation.
Problem

Research questions and friction points this paper is trying to address.

open-vocabulary semantic segmentation
3D point clouds
panoramic images
cross-modal understanding
Innovation

Methods, ideas, or system contributions that make the work stand out.

open-vocabulary segmentation
point cloud
panoramic image
vision-language alignment
cross-modal semantic segmentation
🔎 Similar Papers
No similar papers found.
S
Sandeep Inuganti
German Research Center for Artificial Intelligence, DFKI, Germany; RPTU Kaiserslautern, Germany
H
Hideaki Kanayama
Ricoh Company, Ltd. Japan
K
Kanta Shimizu
Ricoh Company, Ltd. Japan; Ricoh International B.V. - Niederlassung Deutschland, Germany
M
Mahdi Chamseddine
German Research Center for Artificial Intelligence, DFKI, Germany; RPTU Kaiserslautern, Germany
S
Soichiro Yokota
Ricoh Company, Ltd. Japan; Ricoh International B.V. - Niederlassung Deutschland, Germany
Didier Stricker
Didier Stricker
Professor for Computer Science, University Kaiserslautern
augmented realitycomputer visionimage processingbody sensor networkshci
Jason Rambach
Jason Rambach
Team Leader Spatial Sensing and Machine Perception, DFKI GmbH
Computer VisionPattern RecognitionMachine LearningSignal Processing