Solar-VLM: Multimodal Vision-Language Models for Augmented Solar Power Forecasting

📅 2026-04-05
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing photovoltaic (PV) power forecasting methods struggle to effectively integrate multi-source heterogeneous data—such as time series, satellite imagery, and textual weather descriptions—limiting their ability to model complex spatiotemporal dependencies. This work proposes a novel multimodal forecasting framework grounded in large language models, which, for the first time, leverages a vision-language foundation model (using Qwen as the visual backbone) in conjunction with a chunked temporal encoder and a text feature extractor. To capture spatial correlations among PV plants, the framework introduces a cross-site graph attention mechanism built upon a K-nearest neighbor graph and employs adaptive attention for unified multimodal fusion. Experiments on data from eight PV stations in northern China demonstrate that the proposed method significantly improves prediction accuracy, and the implementation code has been made publicly available.
📝 Abstract
Photovoltaic (PV) power forecasting plays a critical role in power system dispatch and market participation. Because PV generation is highly sensitive to weather conditions and cloud motion, accurate forecasting requires effective modeling of complex spatiotemporal dependencies across multiple information sources. Although recent studies have advanced AI-based forecasting methods, most fail to fuse temporal observations, satellite imagery, and textual weather information in a unified framework. This paper proposes Solar-VLM, a large-language-model-driven framework for multimodal PV power forecasting. First, modality-specific encoders are developed to extract complementary features from heterogeneous inputs. The time-series encoder adopts a patch-based design to capture temporal patterns from multivariate observations at each site. The visual encoder, built upon a Qwen-based vision backbone, extracts cloud-cover information from satellite images. The text encoder distills historical weather characteristics from textual descriptions. Second, to capture spatial dependencies across geographically distributed PV stations, a cross-site feature fusion mechanism is introduced. Specifically, a Graph Learner models inter-station correlations through a graph attention network constructed over a K-nearest-neighbor (KNN) graph, while a cross-site attention module further facilitates adaptive information exchange among sites. Finally, experiments conducted on data from eight PV stations in a northern province of China demonstrate the effectiveness of the proposed framework. Our proposed model is publicly available at https://github.com/rhp413/Solar-VLM.
Problem

Research questions and friction points this paper is trying to address.

photovoltaic power forecasting
multimodal fusion
spatiotemporal dependencies
satellite imagery
textual weather information
Innovation

Methods, ideas, or system contributions that make the work stand out.

multimodal fusion
vision-language model
graph attention network
solar power forecasting
cross-site feature fusion
🔎 Similar Papers
No similar papers found.
Hang Fan
Hang Fan
North China Electric Power Univercity;Tsinghua University
Electricity MarketTime series predictionDeep/Machine learning
H
Haoran Pei
School of Control and Computer Engineering, North China Electric Power University, 102206, Beijing, China
R
Runze Liang
Department of Electrical Engineering, Tsinghua University, 100084, Beijing, China
W
Weican Liu
School of Electrical and Electronic Engineering, Nanyang Technological University, 50 Nanyang Avenue, 639798, Singapore
Long Cheng
Long Cheng
Professor, North China Electric Power University
Distributed computingReinforcement learning
Wei Wei
Wei Wei
Associate Professor with tenure, Tsinghua University
decision analyticslearning and optimizationpower system operationintegrated energy system