AMBER: An Adaptive Multimodal Mask Transformer for Beam Prediction with Missing Modalities

📅 2025-12-12
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
In mmWave massive MIMO vehicular networks, beam prediction performance degrades sharply under arbitrary missing multimodal sensing data (e.g., images, LiDAR, radar, GPS). To address this, we propose a robust beam prediction framework. Our key contributions are: (1) learnable modality tokens coupled with a missing-aware dynamic masking mechanism to explicitly model modality absence patterns; (2) a Class-Former-inspired Multimodal Alignment (CMA) module and temporal-aware positional encoding to achieve modality-agnostic, temporally consistent cross-modal representations; and (3) an end-to-end Transformer architecture integrating multi-head attention, learnable fusion tokens, and joint alignment. Evaluated on the real-world DeepSense6G dataset, our method significantly outperforms state-of-the-art approaches under severe multimodal missingness, achieving both high accuracy and strong robustness.

Technology Category

Application Category

📝 Abstract
With the widespread adoption of millimeter-wave (mmWave) massive multi-input-multi-output (MIMO) in vehicular networks, accurate beam prediction and alignment have become critical for high-speed data transmission and reliable access. While traditional beam prediction approaches primarily rely on in-band beam training, recent advances have started to explore multimodal sensing to extract environmental semantics for enhanced prediction. However, the performance of existing multimodal fusion methods degrades significantly in real-world settings because they are vulnerable to missing data caused by sensor blockage, poor lighting, or GPS dropouts. To address this challenge, we propose AMBER ({A}daptive multimodal {M}ask transformer for {BE}am p{R}ediction), a novel end-to-end framework that processes temporal sequences of image, LiDAR, radar, and GPS data, while adaptively handling arbitrary missing-modality cases. AMBER introduces learnable modality tokens and a missing-modality-aware mask to prevent cross-modal noise propagation, along with a learnable fusion token and multihead attention to achieve robust modality-specific information distillation and feature-level fusion. Furthermore, a class-former-aided modality alignment (CMA) module and temporal-aware positional embedding are incorporated to preserve temporal coherence and ensure semantic alignment across modalities, facilitating the learning of modality-invariant and temporally consistent representations for beam prediction. Extensive experiments on the real-world DeepSense6G dataset demonstrate that AMBER significantly outperforms existing multimodal learning baselines. In particular, it maintains high beam prediction accuracy and robustness even under severe missing-modality scenarios, validating its effectiveness and practical applicability.
Problem

Research questions and friction points this paper is trying to address.

Addresses beam prediction degradation from missing multimodal sensor data.
Proposes adaptive transformer for robust fusion with missing modalities.
Enhances accuracy in real-world vehicular mmWave MIMO networks.
Innovation

Methods, ideas, or system contributions that make the work stand out.

Adaptive multimodal mask transformer handles missing modalities
Learnable tokens and masks prevent cross-modal noise propagation
Class-former-aided alignment ensures temporal and semantic coherence
🔎 Similar Papers
No similar papers found.
C
Chenyiming Wen
College of Information Science and Electronic Engineering and the Zhejiang Provincial Key Laboratory of Multi-Modal Communication Networks and Intelligent Information Processing, Zhejiang University, Hangzhou 310027, China
B
Binpu Shi
College of Information Science and Electronic Engineering and the Zhejiang Provincial Key Laboratory of Multi-Modal Communication Networks and Intelligent Information Processing, Zhejiang University, Hangzhou 310027, China
M
Min Li
College of Information Science and Electronic Engineering and the Zhejiang Provincial Key Laboratory of Multi-Modal Communication Networks and Intelligent Information Processing, Zhejiang University, Hangzhou 310027, China
M
Ming-Min Zhao
College of Information Science and Electronic Engineering and the Zhejiang Provincial Key Laboratory of Multi-Modal Communication Networks and Intelligent Information Processing, Zhejiang University, Hangzhou 310027, China
M
Min-Jian Zhao
College of Information Science and Electronic Engineering and the Zhejiang Provincial Key Laboratory of Multi-Modal Communication Networks and Intelligent Information Processing, Zhejiang University, Hangzhou 310027, China
Jiangzhou Wang
Jiangzhou Wang
Professor, University of Kent
Mobile Communications