Human Insights Driven Latent Space for Different Driving Perspectives: A Unified Encoder for Efficient Multi-Task Inference

📅 2024-09-16
🏛️ arXiv.org
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Autonomous driving vision models often suffer from poor generalization in complex urban scenarios due to insufficient navigation-relevant contextual understanding. To address this, we propose a unified multi-task visual encoder tailored for urban driving, jointly learning depth estimation, ego-motion pose, 3D scene flow, and semantic/instance/motion/panoptic segmentation to construct a dense, navigation-semantic-rich latent representation. We introduce a novel multi-scale pose feature network and a knowledge distillation framework leveraging multiple backbone teacher models for efficient, cooperative optimization. Our encoder achieves state-of-the-art or competitive performance across all seven perception tasks. When frozen and transferred to steering angle estimation, it significantly outperforms both fine-tuned baselines and ImageNet-pretrained models, demonstrating superior generalization capability and navigation-oriented representation learning.

Technology Category

Application Category

📝 Abstract
Autonomous driving systems require a comprehensive understanding of the environment, achieved by extracting visual features essential for perception, planning, and control. However, models trained solely on single-task objectives or generic datasets often lack the contextual information needed for robust performance in complex driving scenarios. In this work, we propose a unified encoder trained on multiple computer vision tasks crucial for urban driving, including depth, pose, and 3D scene flow estimation, as well as semantic, instance, panoptic, and motion segmentation. By integrating these diverse visual cues-similar to human perceptual mechanisms-the encoder captures rich features that enhance navigation-related predictions. We evaluate the model on steering estimation as a downstream task, leveraging its dense latent space. To ensure efficient multi-task learning, we introduce a multi-scale feature network for pose estimation and apply knowledge distillation from a multi-backbone teacher model. Our findings highlight two key findings: (1) the unified encoder achieves competitive performance across all visual perception tasks, demonstrating strong generalization capabilities; and (2) for steering estimation, the frozen unified encoder-leveraging dense latent representations-outperforms both its fine-tuned counterpart and the same frozen model pretrained on generic datasets like ImageNet. These results underline the significance of task-specific visual features and demonstrate the promise of multi-task learning in advancing autonomous driving systems. More details and the pretrained model are available at https://hi-computervision.github.io/uni-encoder/.
Problem

Research questions and friction points this paper is trying to address.

Unified encoder for multi-task inference in autonomous driving
Enhancing navigation predictions through diverse visual cues
Efficient multi-task learning with multi-scale feature network
Innovation

Methods, ideas, or system contributions that make the work stand out.

Unified encoder for multi-task visual perception
Multi-scale feature network for pose estimation
Knowledge distillation from multi-backbone teacher model
🔎 Similar Papers
No similar papers found.
H
Huy-Dung Nguyen
Hybrid Intelligence part of Capgemini Engineering
A
Anass Bairouk
Hybrid Intelligence part of Capgemini Engineering
M
Mirjana Maras
Hybrid Intelligence part of Capgemini Engineering
W
Wei Xiao
Computer Science and Artificial Intelligence Lab, Massachusetts Institute of Technology
Tsun-Hsuan Wang
Tsun-Hsuan Wang
Massachusetts Institute of Technology
roboticsmachine learningsimulation
P
Patrick Chareyre
Hybrid Intelligence part of Capgemini Engineering
R
Ramin M. Hasani
Computer Science and Artificial Intelligence Lab, Massachusetts Institute of Technology
Marc Blanchon
Marc Blanchon
Hybrid Intelligence - Capgemini
Deep LearningPolarimetryRoboticsComputer Vision
Daniela Rus
Daniela Rus
Andrew (1956) and Erna Viterbi Professor of Computer Science, MIT
RoboticsWireless NetworksDistributed Computing