Learning Structural Latent Points for Efficient Visual Representations in Robotic Manipulation

📅 2026-05-20

📈 Citations: 0

✨ Influential: 0

career value

210K/year

🤖 AI Summary

Existing 3D perception pretraining methods struggle to balance structural priors and representational capacity in robotic manipulation: implicit representations lack explicit structural cues, while explicit representations suffer from limited resolution and generalization. This work proposes a hybrid representation termed “structured latent point,” which embeds a point-wise variational autoencoder into the latent space of a point cloud autoencoder to jointly regularize point features and coordinates, thereby preserving coarse-grained structure and semantic information. Coupled with a lightweight 3D Gaussian Splatting (3DGS) rendering pipeline, the method concentrates representational power in the front-end latent module. Evaluated on RLBench, ManiSkill2, and real-world robotic platforms, the approach significantly improves task success rates, sample efficiency, and robustness to viewpoint and scene variations.

📝 Abstract

Current 3D-aware pretraining methods for embodied perception and manipulation are largely built on differentiable rendering frameworks, producing either fully implicit neural fields or fully explicit geometric primitives. Implicit representations, while expressive, lack explicit structural cues, whereas explicit ones preserve geometry but suffer from resolution limits and weak generalization. To address these limitations, we propose a novel pretraining framework that learns a hybrid representation-structural latent points. Specifically, we insert a point-wise latent variational autoencoder into the latent space of a point-cloud autoencoder, jointly regularizing point-wise features and coordinates toward a Gaussian prior. The resulting compact latent preserves coarse structural tendencies, which do not encode precise geometry but capture richer rough shape and semantic information, effectively combining the expressiveness of implicit representations with the structural priors of explicit ones. In addition, informed by shared design choices in prior work, we develop a streamlined, efficient 3DGS-based rendering pipeline that is deliberately kept lightweight, improving efficiency while leaving greater representational capacity to the front-end latent module. Extensive evaluations on RLBench, ManiSkill2, and a real-robot platform demonstrate consistent gains in task success, sample efficiency, and robustness to viewpoint and scene variations over strong baselines. Ablation studies further confirm that each component of our framework is critical to overall performance.

Problem

Research questions and friction points this paper is trying to address.

3D-aware pretraining

implicit representations

explicit representations

structural priors

visual representations

Innovation

Methods, ideas, or system contributions that make the work stand out.

structural latent points

hybrid 3D representation

variational autoencoder