Learning the RoPEs: Better 2D and 3D Position Encodings with STRING

πŸ“… 2025-02-04
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
This work addresses the challenge in robot vision of designing 2D/3D positional encodings that simultaneously satisfy strict translation invariance and computational efficiency. We propose STRINGβ€”a separable, provably translation-invariant high-dimensional positional encoding method. STRING generalizes Rotary Position Embedding (RoPE) to arbitrary dimensions via separable tensor decomposition and group-invariant functions, guaranteeing exact translation invariance with negligible computational overhead. It is the first framework to unify modeling of translation-invariant positional embeddings across arbitrary dimensions and enables end-to-end differentiable RGB-D 3D token generation. Evaluated on open-vocabulary object detection and robot control tasks, STRING consistently improves performance, demonstrating its effectiveness for 3D perception and embodied intelligence.

Technology Category

Application Category

πŸ“ Abstract
We introduce STRING: Separable Translationally Invariant Position Encodings. STRING extends Rotary Position Encodings, a recently proposed and widely used algorithm in large language models, via a unifying theoretical framework. Importantly, STRING still provides exact translation invariance, including token coordinates of arbitrary dimensionality, whilst maintaining a low computational footprint. These properties are especially important in robotics, where efficient 3D token representation is key. We integrate STRING into Vision Transformers with RGB(-D) inputs (color plus optional depth), showing substantial gains, e.g. in open-vocabulary object detection and for robotics controllers. We complement our experiments with a rigorous mathematical analysis, proving the universality of our methods.
Problem

Research questions and friction points this paper is trying to address.

Enhances 2D and 3D position encodings
Maintains computational efficiency in robotics
Improves vision transformers with RGB-D inputs
Innovation

Methods, ideas, or system contributions that make the work stand out.

STRING enhances Rotary Position Encodings
Maintains exact translation invariance
Integrates with Vision Transformers effectively
πŸ”Ž Similar Papers
Connor Schenck
Connor Schenck
Google Deepmind
Artificial IntelligenceMachine LearningRoboticsDevelopmental Robotics
Isaac Reid
Isaac Reid
PhD student, University of Cambridge
Machine learningInferenceStatistical physics
Mithun Jacob
Mithun Jacob
Google DeepMind Robotics
roboticsmappinglocalizationslamcomputer vision
Alex Bewley
Alex Bewley
Google DeepMind
RoboticsMachine LearningComputer VisionVision Language Models
Joshua Ainslie
Joshua Ainslie
Google LLC
Machine Learning
D
David Rendleman
Google DeepMind
Deepali Jain
Deepali Jain
Google Deepmind
Artificial IntelligenceRoboticsReinforcement Learning
M
Mohit Sharma
Google DeepMind
Kumar Avinava Dubey
Kumar Avinava Dubey
Google Research
Efficient TransformersLLMs & VLMsScalable MLStatistical ML
A
Ayzaan Wahid
Google DeepMind
S
Sumeet Singh
Google DeepMind
R
Rene Wagner
Google DeepMind
T
Tianli Ding
Google DeepMind
Chuyuan Fu
Chuyuan Fu
Google DeepMind
RoboticsSimulationComputer GraphicsSolid and Fluid Mechanics
Arunkumar Byravan
Arunkumar Byravan
Google DeepMind
J
Jake Varley
Google DeepMind
A
Alexey Gritsenko
Google DeepMind
Matthias Minderer
Matthias Minderer
Member of Technical Staff, Microsoft AI
Representation learningUnsupervised learningObject detectionVision-language models
Dmitry Kalashnikov
Dmitry Kalashnikov
Google
RoboticsMachine LearningReinforcement Learning
Jonathan Tompson
Jonathan Tompson
Meta Reality Labs
Computer Science
Vikas Sindhwani
Vikas Sindhwani
Google DeepMind Robotics
AIRoboticsAI SafetyMachine LearningOptimization
Krzysztof Choromanski
Krzysztof Choromanski
Google DeepMind Robotics & Columbia University
roboticsreinforcement learningefficient Transformersquasi Monte Carlo methods