🤖 AI Summary
To address the low retrieval efficiency in large-scale 3D point cloud scenes, this paper proposes the first differentiable search index (DSI) method tailored for point clouds. Our approach maps point cloud descriptors end-to-end to compact 1D hash-like identifiers, enabling constant-time (O(1)) direct retrieval. We innovatively adapt the DSI framework—originally developed for text—to 3D point clouds by designing a Vision Transformer (ViT)-based joint positional-semantic encoder, which supports differentiable learning of identifiers and end-to-end optimization. Evaluated on standard benchmarks, our method achieves state-of-the-art recall and localization accuracy while significantly accelerating retrieval speed. It overcomes the computational bottlenecks inherent in conventional approximate nearest neighbor (ANN) search methods, establishing a new paradigm for real-time large-scale 3D scene recognition.
📝 Abstract
Retrieval in 3D point clouds is a challenging task that consists in retrieving the most similar point clouds to a given query within a reference of 3D points. Current methods focus on comparing descriptors of point clouds in order to identify similar ones. Due to the complexity of this latter step, here we focus on the acceleration of the retrieval by adapting the Differentiable Search Index (DSI), a transformer-based approach initially designed for text information retrieval, for 3D point clouds retrieval. Our approach generates 1D identifiers based on the point descriptors, enabling direct retrieval in constant time. To adapt DSI to 3D data, we integrate Vision Transformers to map descriptors to these identifiers while incorporating positional and semantic encoding. The approach is evaluated for place recognition on a public benchmark comparing its retrieval capabilities against state-of-the-art methods, in terms of quality and speed of returned point clouds.