PointAlign: Feature-Level Alignment Regularization for 3D Vision-Language Models

📅 2026-02-27
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing 3D vision-language models suffer from severe degradation of geometric information in their intermediate representations due to the scarcity of paired 3D-text data and reliance solely on language token supervision. To address this, this work proposes a lightweight, feature-level alignment regularization method that explicitly enforces consistency between intermediate point cloud tokens and the original visual input during language modeling via a consistency loss, thereby preserving fine-grained geometric semantics. This approach introduces, for the first time in 3D vision-language models, a feature-level alignment mechanism that requires training only an alignment projector and LoRA adapters, ensuring high efficiency with minimal computational overhead. Experiments demonstrate consistent improvements: average classification accuracy increases by 2.08% on ModelNet40 and Objaverse, open-vocabulary classification performance improves by 7.50%, and 3D captioning quality rises by 4.88%.

Technology Category

Application Category

📝 Abstract
The development of 3D Vision-Language Models (VLMs), crucial for applications in robotics, autonomous driving, and augmented reality, is severely constrained by the scarcity of paired 3D-text data. Existing methods rely solely on next-token prediction loss, using only language tokens for supervision. This results in inefficient utilization of limited 3D data and leads to a significant degradation and loss of valuable geometric information in intermediate representations. To address these limitations, we propose {\mname}, a novel feature-level alignment regularization method. {\mname} explicitly supervises intermediate point cloud tokens to preserve fine-grained 3D geometric-semantic information throughout the language modeling process. Specifically, we constrain the intermediate point cloud tokens within the LLM to align with visual input tokens via a consistency loss. By training only a lightweight alignment projector and LoRA adapters, {\mname} achieves explicit feature-level supervision with minimal computational overhead, effectively preventing geometric degradation. Extensive experiments on ModelNet40 and Objaverse datasets demonstrate that our method achieves \textbf{2.08} pp improvement on average for classification tasks, with a substantial \textbf{7.50} pp gain on the challenging open-vocabulary Objaverse classification task and \textbf{4.88} pp improvement on 3D object captioning evaluated by Qwen2-72B-Instruct, validating the effectiveness of {\mname}. Code is publicly available at \href{https://github.com/yharoldsu0627/PointAlign}{https://github.com/yharoldsu0627/PointAlign}.
Problem

Research questions and friction points this paper is trying to address.

3D Vision-Language Models
paired 3D-text data scarcity
geometric information degradation
inefficient 3D data utilization
intermediate representation
Innovation

Methods, ideas, or system contributions that make the work stand out.

feature-level alignment
3D vision-language models
point cloud token supervision
geometric-semantic preservation
lightweight regularization
🔎 Similar Papers
No similar papers found.