🤖 AI Summary
Existing 3D point cloud encoders are sensitive to variations in sampling resolution and scale, limiting their generalization capability. This work introduces, for the first time, the concept of resolution decoupling into 3D point cloud learning and proposes a lightweight encoder architecture that enables resolution- and density-invariant semantic feature learning through a next-resolution prediction mechanism and receptive field calibration. The proposed method substantially enhances model robustness and efficiency: on ScanNet, it achieves a 56.0% relative improvement in mIoU when input resolution is reduced by a factor of three, a 20% performance gain when objects are scaled down to one-third of their original size, while simultaneously reducing model size by 45% and decreasing the average number of input tokens by 40%.
📝 Abstract
Modern image encoders achieve high generalization by decoupling semantic meaning from resolution, an ability yet to be fully realized in the 3D domain. We investigate the failure of 3D point cloud encoders to achieve similar generalization and find that existing models are highly sensitive to sampling resolution and scale changes, leading to significant performance degradation. This sensitivity is a major bottleneck for real-world deployment in robotics, as it suggests models overfit to specific quantization densities and object scales rather than learning invariant semantic features. To mitigate this dependency, we propose Invaria, a point cloud encoder that achieves scale and density invariance through next-resolution prediction and receptive field calibration. While our objective is not the explicit generation of high-resolution point clouds, we find that this training objective encourages the model to learn robust, structural invariants. The resulting encoder achieves significant performance gains during resolution shifts while maintaining high efficiency through a compact model size and reduced token requirements. Specifically, on ScanNet, Invaria achieves a 56.0\% higher mIoU at 3$\times$ lower resolution and a 20\% improvement when the objects scale is reduced by a factor of 3. These gains are achieved with a 45\% smaller model size and an average reduction of 40\% in input tokens.