Solar-VLM: Multimodal Vision-Language Models for Augmented Solar Power Forecasting

📅 2026-04-05

📈 Citations: 0

✨ Influential: 0

career value

177K/year

🤖 AI Summary

Existing photovoltaic (PV) power forecasting methods struggle to effectively integrate multi-source heterogeneous data—such as time series, satellite imagery, and textual weather descriptions—limiting their ability to model complex spatiotemporal dependencies. This work proposes a novel multimodal forecasting framework grounded in large language models, which, for the first time, leverages a vision-language foundation model (using Qwen as the visual backbone) in conjunction with a chunked temporal encoder and a text feature extractor. To capture spatial correlations among PV plants, the framework introduces a cross-site graph attention mechanism built upon a K-nearest neighbor graph and employs adaptive attention for unified multimodal fusion. Experiments on data from eight PV stations in northern China demonstrate that the proposed method significantly improves prediction accuracy, and the implementation code has been made publicly available.

Technology Category

Application Category

📝 Abstract

Photovoltaic (PV) power forecasting plays a critical role in power system dispatch and market participation. Because PV generation is highly sensitive to weather conditions and cloud motion, accurate forecasting requires effective modeling of complex spatiotemporal dependencies across multiple information sources. Although recent studies have advanced AI-based forecasting methods, most fail to fuse temporal observations, satellite imagery, and textual weather information in a unified framework. This paper proposes Solar-VLM, a large-language-model-driven framework for multimodal PV power forecasting. First, modality-specific encoders are developed to extract complementary features from heterogeneous inputs. The time-series encoder adopts a patch-based design to capture temporal patterns from multivariate observations at each site. The visual encoder, built upon a Qwen-based vision backbone, extracts cloud-cover information from satellite images. The text encoder distills historical weather characteristics from textual descriptions. Second, to capture spatial dependencies across geographically distributed PV stations, a cross-site feature fusion mechanism is introduced. Specifically, a Graph Learner models inter-station correlations through a graph attention network constructed over a K-nearest-neighbor (KNN) graph, while a cross-site attention module further facilitates adaptive information exchange among sites. Finally, experiments conducted on data from eight PV stations in a northern province of China demonstrate the effectiveness of the proposed framework. Our proposed model is publicly available at https://github.com/rhp413/Solar-VLM.

Problem

Research questions and friction points this paper is trying to address.

photovoltaic power forecasting

multimodal fusion

spatiotemporal dependencies

satellite imagery

textual weather information

Innovation

Methods, ideas, or system contributions that make the work stand out.

multimodal fusion

vision-language model

graph attention network