MultiPark: Multimodal Parking Transformer with Next-Segment Prediction

📅 2025-08-15
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing imitation learning methods for autonomous parking in narrow, lane-free environments suffer from inadequate multimodal behavior modeling and causal confounding, leading to poor generalization. To address these issues, this paper proposes a multimodal autonomous parking framework: (1) a learnable decomposed parking query mechanism coupled with a target-centered pose loss, integrated within a next-path-segment prediction paradigm to enhance spatial generalization and temporal extrapolation; and (2) an autoregressive Transformer architecture that jointly decodes gear selection, longitudinal, and lateral actions—explicitly capturing behavioral diversity. Evaluated on real-world datasets, the method achieves state-of-the-art performance and has been successfully deployed in production, demonstrating its effectiveness, robustness, and engineering practicality in complex, unconstrained parking scenarios.

Technology Category

Application Category

📝 Abstract
Parking accurately and safely in highly constrained spaces remains a critical challenge. Unlike structured driving environments, parking requires executing complex maneuvers such as frequent gear shifts and steering saturation. Recent attempts to employ imitation learning (IL) for parking have achieved promising results. However, existing works ignore the multimodal nature of parking behavior in lane-free open space, failing to derive multiple plausible solutions under the same situation. Notably, IL-based methods encompass inherent causal confusion, so enabling a neural network to generalize across diverse parking scenarios is particularly difficult. To address these challenges, we propose MultiPark, an autoregressive transformer for multimodal parking. To handle paths filled with abrupt turning points, we introduce a data-efficient next-segment prediction paradigm, enabling spatial generalization and temporal extrapolation. Furthermore, we design learnable parking queries factorized into gear, longitudinal, and lateral components, parallelly decoding diverse parking behaviors. To mitigate causal confusion in IL, our method employs target-centric pose and ego-centric collision as outcome-oriented loss across all modalities beyond pure imitation loss. Evaluations on real-world datasets demonstrate that MultiPark achieves state-of-the-art performance across various scenarios. We deploy MultiPark on a production vehicle, further confirming our approach's robustness in real-world parking environments.
Problem

Research questions and friction points this paper is trying to address.

Addressing multimodal parking behavior in open spaces
Overcoming causal confusion in imitation learning for parking
Enhancing spatial generalization and temporal extrapolation in parking
Innovation

Methods, ideas, or system contributions that make the work stand out.

Autoregressive transformer for multimodal parking
Next-segment prediction for spatial generalization
Learnable parking queries factorized into components
🔎 Similar Papers
No similar papers found.
H
Han Zheng
Shanghai Jiao Tong University, Shanghai, China
Z
Zikang Zhou
Zhuoyu Technology, Co., Ltd., Shenzhen, China
G
Guli Zhang
Zhuoyu Technology, Co., Ltd., Shenzhen, China
Z
Zhepei Wang
Zhuoyu Technology, Co., Ltd., Shenzhen, China
K
Kaixuan Wang
Zhuoyu Technology, Co., Ltd., Shenzhen, China
P
Peiliang Li
Zhuoyu Technology, Co., Ltd., Shenzhen, China
Shaojie Shen
Shaojie Shen
Associate Professor, Hong Kong University of Science and Technology
Robotics
M
Ming Yang
Shanghai Jiao Tong University, Shanghai, China
Tong Qin
Tong Qin
Shanghai Jiao Tong University
RoboticsSLAMComputer Vision