🤖 AI Summary
Single-frame visual inputs exhibit poor robustness in complex scenes, while state-of-the-art temporal models suffer from excessive computational overhead, hindering their deployment in federated learning (FL) settings.
Method: We propose a lightweight Temporal Transformer decomposition framework that factorizes the global attention matrix into low-rank components, drastically reducing parameter count and computational complexity. We further design an FL-aware distributed training strategy enabling efficient parameter aggregation and real-time inference. The model jointly processes multi-frame images and steering sequences to achieve accurate temporal modeling and cross-modal feature fusion under resource constraints.
Results: Our approach outperforms existing SOTA methods on three benchmark datasets, achieving significant accuracy gains and inference latency below 30 ms. Extensive real-world robotic experiments validate its practical deployability and strong generalization capability in heterogeneous edge environments.
📝 Abstract
Traditional vision-based autonomous driving systems often face difficulties in navigating complex environments when relying solely on single-image inputs. To overcome this limitation, incorporating temporal data such as past image frames or steering sequences, has proven effective in enhancing robustness and adaptability in challenging scenarios. While previous high-performance methods exist, they often rely on resource-intensive fusion networks, making them impractical for training and unsuitable for federated learning. To address these challenges, we propose lightweight temporal transformer decomposition, a method that processes sequential image frames and temporal steering data by breaking down large attention maps into smaller matrices. This approach reduces model complexity, enabling efficient weight updates for convergence and real-time predictions while leveraging temporal information to enhance autonomous driving performance. Intensive experiments on three datasets demonstrate that our method outperforms recent approaches by a clear margin while achieving real-time performance. Additionally, real robot experiments further confirm the effectiveness of our method.