VISTA: Vision-Language Inference for Training-Free Stock Time-Series Analysis

📅 2025-05-24
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Stock price forecasting faces challenges in capturing complementary patterns from heterogeneous data sources using unimodal approaches. This paper proposes a training-free multimodal zero-shot inference framework that jointly processes historical stock price time series (as textual sequences) and their corresponding line charts (as visual inputs), leveraging vision-language models (e.g., LLaVA, Qwen-VL) for end-to-end future price prediction. Our key contributions are: (1) the first training-free multimodal paradigm for financial time-series analysis; and (2) a chain-of-thought prompting mechanism that synergistically extracts temporal patterns from both numerical and visual modalities. Evaluated on standard benchmarks, our method achieves up to an 89.83% improvement over ARIMA and text-only large language models. These results robustly demonstrate the efficacy and generalizability of unsupervised multimodal collaborative reasoning in financial forecasting.

Technology Category

Application Category

📝 Abstract
Stock price prediction remains a complex and high-stakes task in financial analysis, traditionally addressed using statistical models or, more recently, language models. In this work, we introduce VISTA (Vision-Language Inference for Stock Time-series Analysis), a novel, training-free framework that leverages Vision-Language Models (VLMs) for multi-modal stock forecasting. VISTA prompts a VLM with both textual representations of historical stock prices and their corresponding line charts to predict future price values. By combining numerical and visual modalities in a zero-shot setting and using carefully designed chain-of-thought prompts, VISTA captures complementary patterns that unimodal approaches often miss. We benchmark VISTA against standard baselines, including ARIMA and text-only LLM-based prompting methods. Experimental results show that VISTA outperforms these baselines by up to 89.83%, demonstrating the effectiveness of multi-modal inference for stock time-series analysis and highlighting the potential of VLMs in financial forecasting tasks without requiring task-specific training.
Problem

Research questions and friction points this paper is trying to address.

Training-free stock price prediction using vision-language models
Multi-modal forecasting combining text and visual data
Zero-shot approach outperforming traditional statistical methods
Innovation

Methods, ideas, or system contributions that make the work stand out.

Training-free VLM framework for stock prediction
Combines numerical and visual data modalities
Zero-shot chain-of-thought prompting technique
🔎 Similar Papers
No similar papers found.