PLAF: Pixel-wise Language-Aligned Feature Extraction for Efficient 3D Scene Understanding

📅 2026-04-17
📈 Citations: 0
Influential: 0
📄 PDF

career value

203K/year
🤖 AI Summary
Existing approaches struggle to simultaneously achieve pixel-level spatial precision, open-vocabulary semantic expressiveness aligned with language, and efficient scalability in 3D scenes, often introducing redundancy when densely propagating 2D semantics into 3D. To address this, this work proposes PLAF, a framework that jointly optimizes pixel-level language-aligned feature extraction and an efficient cross-modal semantic storage-and-query mechanism spanning 2D and 3D domains. By integrating semantic compression with precise mapping, PLAF substantially reduces redundancy and memory overhead while enabling accurate and efficient open-vocabulary understanding of 3D scenes. Experiments demonstrate that PLAF achieves superior accuracy, inference efficiency, and scalability across multiple 3D semantic tasks, and the code is publicly released.

Technology Category

Application Category

📝 Abstract
Accurate open-vocabulary 3D scene understanding requires semantic representations that are both language-aligned and spatially precise at the pixel level, while remaining scalable when lifted to 3D space. However, existing representations struggle to jointly satisfy these requirements, and densely propagating pixel-wise semantics to 3D often results in substantial redundancy, leading to inefficient storage and querying in large-scale scenes. To address these challenges, we present \emph{PLAF}, a Pixel-wise Language-Aligned Feature extraction framework that enables dense and accurate semantic alignment in 2D without sacrificing open-vocabulary expressiveness. Building upon this representation, we further design an efficient semantic storage and querying scheme that significantly reduces redundancy across both 2D and 3D domains. Experimental results show that \emph{PLAF} provides a strong semantic foundation for accurate and efficient open-vocabulary 3D scene understanding. The codes are publicly available at https://github.com/RockWenJJ/PLAF.
Problem

Research questions and friction points this paper is trying to address.

open-vocabulary 3D scene understanding
language-aligned features
pixel-wise semantics
3D semantic redundancy
efficient semantic storage
Innovation

Methods, ideas, or system contributions that make the work stand out.

Pixel-wise
Language-Aligned
Open-Vocabulary
3D Scene Understanding
Semantic Redundancy Reduction