🤖 AI Summary
In bandwidth-, computation-, and latency-constrained edge wireless networks with low signal-to-noise ratios (SNRs) and multiple users, deploying large multimodal models (MLMs) faces severe challenges in communication efficiency, semantic consistency, and inference robustness.
Method: This paper proposes a token-centric distributed MLM deployment paradigm, where task-relevant tokens serve as the communication medium. We design a token-communication-driven collaborative inference architecture, introduce contrastive embodied fine-tuning for cross-modal semantic alignment, and incorporate a lightweight token compression mechanism to drastically reduce transmission overhead without compromising accuracy. Furthermore, we jointly optimize multimodal transceivers and the base model to enable end-edge co-training.
Results: Experiments under diverse SNR conditions demonstrate a 13.7% accuracy gain, accelerated convergence, and robust inference with shorter token lengths—validating the framework’s scalability and channel resilience.
📝 Abstract
The proliferation of intelligent applications at the wireless edge, alongside the exponential growth of multimodal data, poses challenges for deploying multimodal large models (MLMs) in resource-constrained networks. These constraints manifest as limited bandwidth, computational capacity, and stringent latency requirements, particularly under low signal-to-noise ratio (SNR) conditions. To overcome these limitations, we propose a token communication paradigm that facilitates the decentralized deployment of MLMs across user devices and edge infrastructure (e.g., base stations). In this paradigm, task-relevant tokens are extracted from multimodal inputs and serve as the primary medium for communication between distributed model components. To align semantics and optimize transmission efficiency, we propose a dual-pronged approach: 1) We design a contrastive split fine-tuning method to project heterogeneous modalities into a shared feature space, enabling seamless interaction between model components while preserving modal-specific semantics. 2) We employ a lightweight compression technique to reduce the size of transmitted tokens, minimizing bandwidth consumption without sacrificing task-critical information. The proposed framework integrates collaborative fine-tuning of both the foundation model and multimodal transceivers, ensuring that token generation and utilization are tailored to specific downstream tasks. Simulation experiments conducted under different SNR conditions demonstrate that our method results in a $13.7%$ improvement in test accuracy. Furthermore, our approach exhibits quicker convergence rates, even with reduced token lengths, highlighting the promise of token communication for facilitating more scalable and resilient MLM implementations in practical multiuser networks.