π€ AI Summary
This work addresses the challenge in robot vision of designing 2D/3D positional encodings that simultaneously satisfy strict translation invariance and computational efficiency. We propose STRINGβa separable, provably translation-invariant high-dimensional positional encoding method. STRING generalizes Rotary Position Embedding (RoPE) to arbitrary dimensions via separable tensor decomposition and group-invariant functions, guaranteeing exact translation invariance with negligible computational overhead. It is the first framework to unify modeling of translation-invariant positional embeddings across arbitrary dimensions and enables end-to-end differentiable RGB-D 3D token generation. Evaluated on open-vocabulary object detection and robot control tasks, STRING consistently improves performance, demonstrating its effectiveness for 3D perception and embodied intelligence.
π Abstract
We introduce STRING: Separable Translationally Invariant Position Encodings. STRING extends Rotary Position Encodings, a recently proposed and widely used algorithm in large language models, via a unifying theoretical framework. Importantly, STRING still provides exact translation invariance, including token coordinates of arbitrary dimensionality, whilst maintaining a low computational footprint. These properties are especially important in robotics, where efficient 3D token representation is key. We integrate STRING into Vision Transformers with RGB(-D) inputs (color plus optional depth), showing substantial gains, e.g. in open-vocabulary object detection and for robotics controllers. We complement our experiments with a rigorous mathematical analysis, proving the universality of our methods.