Efficient Multi-Camera Tokenization with Triplanes for End-to-End Driving

📅 2025-06-13

📈 Citations: 0

✨ Influential: 0

career value

210K/year

🤖 AI Summary

To address the low real-time tokenization efficiency and weak geometric modeling in end-to-end autonomous driving with multi-camera sensor data, this paper proposes a geometry-aware multi-view representation based on triplane encoding—the first application of triplane representations from 3D neural reconstruction to perception tokenization for autonomous driving. Our method integrates neural radiance fields (NeRF), multi-camera geometric calibration constraints, differentiable voxel sampling, and a lightweight feature projection network, yielding a compact, camera-agnostic, resolution-agnostic visual representation that explicitly encodes 3D geometry. Evaluated on large-scale real-world autonomous vehicle datasets and neural simulators, our approach reduces token count by 72%, accelerates policy inference by 50%, maintains identical open-loop planning accuracy, and significantly decreases closed-loop trajectory deviation rates.

Technology Category

Application Category

📝 Abstract

Autoregressive Transformers are increasingly being deployed as end-to-end robot and autonomous vehicle (AV) policy architectures, owing to their scalability and potential to leverage internet-scale pretraining for generalization. Accordingly, tokenizing sensor data efficiently is paramount to ensuring the real-time feasibility of such architectures on embedded hardware. To this end, we present an efficient triplane-based multi-camera tokenization strategy that leverages recent advances in 3D neural reconstruction and rendering to produce sensor tokens that are agnostic to the number of input cameras and their resolution, while explicitly accounting for their geometry around an AV. Experiments on a large-scale AV dataset and state-of-the-art neural simulator demonstrate that our approach yields significant savings over current image patch-based tokenization strategies, producing up to 72% fewer tokens, resulting in up to 50% faster policy inference while achieving the same open-loop motion planning accuracy and improved offroad rates in closed-loop driving simulations.

Problem

Research questions and friction points this paper is trying to address.

Efficient tokenization of multi-camera sensor data

Reducing token count for faster policy inference

Maintaining accuracy in autonomous vehicle motion planning

Innovation

Methods, ideas, or system contributions that make the work stand out.

Triplane-based multi-camera tokenization strategy

Geometry-aware sensor tokens for AVs

72% fewer tokens, 50% faster inference

🔎 Similar Papers

MiniDrive: More Efficient Vision-Language Models with Multi-Level 2D Features as Text Tokens for Autonomous Driving