🤖 AI Summary
Current time-series forecasting evaluation is compromised by low-quality benchmarks, which suffer from pretraining data contamination, causal leakage, and cross-modal descriptive information leakage—leading to spurious performance gains. Method: We propose a “high-fidelity benchmark” paradigm, formally defining three foundational principles: data-source reliability, causal rigor, and modality-structure clarity. Based on real-time API collection, we construct Fidel-TS—a large-scale, multimodal time-series benchmark—and introduce causal isolation mechanisms and strict cross-modal alignment strategies. Contribution/Results: Empirical analysis reveals substantial evaluation bias in existing benchmarks. In contrast, Fidel-TS effectively exposes models’ generalization failures under realistic conditions, establishing the first causally credible evaluation standard for multimodal time-series forecasting.
📝 Abstract
The evaluation of time series forecasting models is hindered by a critical lack of high-quality benchmarks, leading to a potential illusion of progress. Existing datasets suffer from issues ranging from pre-training data contamination in the age of LLMs to the causal and description leakage prevalent in early multimodal designs. To address this, we formalize the core principles of high-fidelity benchmarking, focusing on data sourcing integrity, strict causal soundness, and structural clarity. We introduce Fidel-TS, a new large-scale benchmark built from the ground up on these principles by sourcing data from live APIs. Our extensive experiments validate this approach by exposing the critical biases and design limitations of prior benchmarks. Furthermore, we conclusively demonstrate that the causal relevance of textual information is the key factor in unlocking genuine performance gains in multimodal forecasting.