Vision-LLMs for Spatiotemporal Traffic Forecasting

📅 2025-10-13
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This paper addresses two key challenges in grid-based traffic forecasting: (1) difficulty in modeling complex spatiotemporal dependencies, and (2) poor generalization under data scarcity. To this end, we propose a novel vision–language fusion paradigm for traffic prediction. Methodologically: (1) traffic grids are treated as “spatiotemporal images” and encoded by a Vision-LLM to capture global spatial structure; (2) a single-token floating-point number encoding scheme coupled with two-stage numerical alignment fine-tuning enables high-fidelity numerical semantic modeling; (3) GRPO—a reinforcement learning algorithm—is employed to optimize the prediction policy. Experiments on real-world datasets demonstrate that our approach reduces long-horizon prediction error by 15.6% and outperforms the second-best method by 30.04% in cross-domain few-shot scenarios, significantly enhancing robustness and generalization under sparse-data conditions.

Technology Category

Application Category

📝 Abstract
Accurate spatiotemporal traffic forecasting is a critical prerequisite for proactive resource management in dense urban mobile networks. While Large Language Models (LLMs) have shown promise in time series analysis, they inherently struggle to model the complex spatial dependencies of grid-based traffic data. Effectively extending LLMs to this domain is challenging, as representing the vast amount of information from dense geographical grids can be inefficient and overwhelm the model's context. To address these challenges, we propose ST-Vision-LLM, a novel framework that reframes spatiotemporal forecasting as a vision-language fusion problem. Our approach leverages a Vision-LLM visual encoder to process historical global traffic matrices as image sequences, providing the model with a comprehensive global view to inform cell-level predictions. To overcome the inefficiency of LLMs in handling numerical data, we introduce an efficient encoding scheme that represents floating-point values as single tokens via a specialized vocabulary, coupled with a two-stage numerical alignment fine-tuning process. The model is first trained with Supervised Fine-Tuning (SFT) and then further optimized for predictive accuracy using Group Relative Policy Optimization (GRPO), a memory-efficient reinforcement learning method. Evaluations on real-world mobile traffic datasets demonstrate that ST-Vision-LLM outperforms existing methods by 15.6% in long-term prediction accuracy and exceeds the second-best baseline by over 30.04% in cross-domain few-shot scenarios. Our extensive experiments validate the model's strong generalization capabilities across various data-scarce environments.
Problem

Research questions and friction points this paper is trying to address.

Extending LLMs to model spatial dependencies in traffic data
Overcoming inefficiency in representing dense geographical grid information
Improving cross-domain generalization for mobile traffic forecasting
Innovation

Methods, ideas, or system contributions that make the work stand out.

Vision-LLM visual encoder processes traffic matrices as images
Efficient encoding represents floating-point values as tokens
Two-stage fine-tuning combines SFT and GRPO reinforcement learning
🔎 Similar Papers
No similar papers found.
N
Ning Yang
Institute of Automation, Chinese Academy of Sciences, Beijing, 100190, China
H
Hengyu Zhong
Westa College, Southwest University, Chongqing, 400715, China. This work was performed while he was an intern at the Institute of Automation, Chinese Academy of Sciences
Haijun Zhang
Haijun Zhang
Professor, IEEE Fellow, University of Science and Technology Beijing
6GAI enabled Wireless CommunicationsResource AllocationMobility Management
Randall Berry
Randall Berry
Professor of Electrical and Computer Engineering, Northwestern University
Network EconomicsWireless NetworksInformation TheoryCommunicationsNetworking