🤖 AI Summary
Large language models (LLMs) consistently underperform on time-series forecasting, classification, imputation, and anomaly detection due to their non-time-series-native architecture, which distorts temporal structure and induces spurious alignment—i.e., apparent effectiveness arising from reliance on data’s intrinsic manifold rather than genuine modality alignment. Method: We conduct a multi-task benchmark evaluation, embedding-space manifold analysis, modality alignment diagnostics, and randomized initialization ablation studies across diverse LLM variants. Results: All evaluated LLMs fail to surpass lightweight linear baselines; several configurations exhibit significant degradation. Crucially, we provide the first manifold-geometric proof of structural misalignment between current LLM-based time-series methods and the underlying temporal dynamics. Our core contribution establishes that time-series efficacy does not stem from linguistic capability but hinges on faithful modeling of the data’s intrinsic dynamical structure—yielding a rigorous theoretical criterion and paradigmatic warning for designing time-series foundation models.
📝 Abstract
Large Language Models have demonstrated impressive performance in many pivotal web applications such as sensor data analysis. However, since LLMs are not designed for time series tasks, simpler models like linear regressions can often achieve comparable performance with far less complexity. In this study, we perform extensive experiments to assess the effectiveness of applying LLMs to key time series tasks, including forecasting, classification, imputation, and anomaly detection. We compare the performance of LLMs against simpler baseline models, such as single-layer linear models and randomly initialized LLMs. Our results reveal that LLMs offer minimal advantages for these core time series tasks and may even distort the temporal structure of the data. In contrast, simpler models consistently outperform LLMs while requiring far fewer parameters. Furthermore, we analyze existing reprogramming techniques and show, through data manifold analysis, that these methods fail to effectively align time series data with language and display pseudo-alignment behaviour in embedding space. Our findings suggest that the performance of LLM-based methods in time series tasks arises from the intrinsic characteristics and structure of time series data, rather than any meaningful alignment with the language model architecture.