🤖 AI Summary
This work addresses the limitation of existing financial time series forecasting models, which provide only numerical predictions without actionable decision guidance, and the difficulty traditional language models face in generating high-quality forward-looking advice under outcome uncertainty. The study introduces hindsight preference optimization (HPO) into financial advisory settings for the first time, combining it with direct preference optimization (DPO) to automatically construct preference pairs using ex-post ground-truth outcomes. This approach trains a vision-language model to deliver integrated guidance encompassing reasoning, actionable recommendations, and risk management—without requiring human annotations. Evaluated on S&P 500 data, the resulting 4B-parameter model surpasses a 235B-parameter teacher model in both predictive accuracy and advice quality, demonstrating that a compact model can outperform substantially larger counterparts through this novel training paradigm.
📝 Abstract
Time series models predict numbers; decision-makers need advisory -- directional signals with reasoning, actionable suggestions, and risk management. Training language models for such predictive advisory faces a fundamental challenge: quality depends on outcomes unknown at prediction time. We bridge two ideas from reinforcement learning -- using information unavailable during execution to retrospectively generate training signal, and preference alignment -- and propose Hindsight Preference Optimization: observed outcomes let an LLM judge rank candidate advisories on dimensions that scalar metrics cannot capture, producing preference pairs for DPO without human annotation. We apply this to Vision-Language-Model-based predictive advisories on S&P 500 equity time series, demonstrated by a 4B model outperforming its 235B teacher on both accuracy and advisory quality.