🤖 AI Summary
This work addresses the scarcity of high-quality multimodal data and systematic evaluation benchmarks for multimodal large language models (MLLMs) in the electromagnetic signal domain, as well as the significant performance degradation of existing methods under low signal-to-noise ratio (SNR) conditions. To this end, the authors construct EM-100k, the first large-scale paired electromagnetic signal–text dataset, and introduce EM-Bench, a comprehensive evaluation benchmark. They further propose MERLIN, a training framework that enhances model generalization in low-SNR environments through signal–semantic alignment and robustness augmentation mechanisms. This study establishes the first native MLLM paradigm tailored to the electromagnetic domain, encompassing data curation, benchmarking, and modeling. Experiments demonstrate that MERLIN achieves state-of-the-art performance on EM-Bench and exhibits exceptional robustness and stability under low-SNR conditions.
📝 Abstract
The paradigm of Multimodal Large Language Models (MLLMs) offers a promising blueprint for advancing the electromagnetic (EM) domain. However, prevailing approaches often deviate from the native MLLM paradigm, instead using task-specific or pipelined architectures that lead to fundamental limitations in model performance and generalization. Fully realizing the MLLM potential in EM domain requires overcoming three main challenges: (1) Data. The scarcity of high-quality datasets with paired EM signals and descriptive text annotations used for MLLMs pre-training; (2) Benchmark. The absence of comprehensive benchmarks to systematically evaluate and compare the performance of models on EM signal-to-text tasks; (3) Model. A critical fragility in low Signal-to-Noise Ratio (SNR) environments, where critical signal features can be obscured, leading to significant performance degradation. To address these challenges, we introduce a tripartite contribution to establish a foundation for MLLMs in the EM domain. First, to overcome data scarcity, we construct and release EM-100k, a large-scale dataset comprising over 100,000 EM signal-text pairs. Second, to enable rigorous and standardized evaluation, we propose EM-Bench, the most comprehensive benchmark featuring diverse downstream tasks spanning from perception to reasoning. Finally, to tackle the core modeling challenge, we present MERLIN, a novel training framework designed not only to align low-level signal representations with high-level semantic text, but also to explicitly enhance model robustness and performance in challenging low-SNR environments. Comprehensive experiments validate our method, showing that MERLIN is state-of-the-art in the EM-Bench and exhibits remarkable robustness in low-SNR settings.