Reinforcing 3D Understanding in Point-VLMs via Geometric Reward Credit Assignment

📅 2026-04-22

📈 Citations: 0

✨ Influential: 0

career value

202K/year

🤖 AI Summary

This work addresses the geometric hallucinations commonly observed in existing Point-Vision-Language Models during 3D structure generation, which stem from sparse geometric signals being overwhelmed by global rewards in reinforcement learning. To mitigate this issue, the authors propose a geometric reward credit assignment mechanism that decouples holistic supervision into domain-specific signals and accurately propagates them to the corresponding tokens. Additionally, they introduce a reprojection consistency constraint as a cross-modal verifier to ensure the physical plausibility of 3D predictions. Evaluated on the ShapeNetCore benchmark, the method achieves significant performance gains, attaining a 3D keypoint accuracy (KPA) of 0.93, a 3D bounding box IoU of 0.686, and a reprojection consistency of 0.852, while preserving strong 2D localization capabilities.

Technology Category

Application Category

📝 Abstract

Point-Vision-Language Models promise to empower embodied agents with executable spatial reasoning, yet they frequently succumb to geometric hallucination where predicted 3D structures contradict the observed 2D reality. We identify a key cause of this failure not as a representation bottleneck but as a structural misalignment in reinforcement learning, where sparse geometric tokens are drowned out by noisy and broadcasted sequence-level rewards. To resolve this causal dilution, we propose Geometric Reward Credit Assignment, a framework that disentangles holistic supervision into field-specific signals and routes them exclusively to their responsible token spans. This mechanism transforms vague feedback into precise gradient updates and effectively turns generic policy optimization into targeted structural alignment. Furthermore, we internalize physical constraints via a Reprojection-Consistency term which serves as a cross-modal verifier to penalize physically impossible geometries. Validated on a calibrated benchmark derived from ShapeNetCore, our approach bridges the reliability gap by boosting 3D KPA from 0.64 to 0.93, increasing 3D bounding box intersection over union to 0.686, and raising reprojection consistency scores to 0.852. Crucially, these gains are achieved while maintaining robust 2D localization performance, marking a meaningful step from plausible textual outputs toward physically verifiable spatial predictions.

Problem

Research questions and friction points this paper is trying to address.

geometric hallucination

3D understanding

Point-VLMs

spatial reasoning

reinforcement learning

Innovation

Methods, ideas, or system contributions that make the work stand out.

Geometric Reward Credit Assignment

Point-Vision-Language Models

3D Understanding