Prompting Underestimates LLM Capability for Time Series Classification

📅 2026-01-06

🏛️ arXiv.org

📈 Citations: 0

✨ Influential: 0

career value

199K/year

🤖 AI Summary

This study addresses a critical limitation in current prompt-based evaluation methods for large language models (LLMs) on time series classification tasks, which substantially underestimate LLMs’ true temporal understanding capabilities. Through systematic comparison of prompt-based outputs and linear probes applied to identical internal representations, the work reveals—for the first time—that prompting fails to effectively elicit the discriminative temporal information already encoded within LLMs, often as early as the initial Transformer layers. Experimental results demonstrate that linear probes significantly improve average F1 scores from 0.15–0.26 to 0.61–0.67, frequently outperforming specialized time series models. These findings establish that LLMs possess strong inherent time series comprehension, and that linear probing provides a more accurate and reliable assessment of their temporal reasoning abilities than conventional prompting strategies.

Technology Category

Application Category

📝 Abstract

Prompt-based evaluations suggest that large language models (LLMs) perform poorly on time series classification, raising doubts about whether they encode meaningful temporal structure. We show that this conclusion reflects limitations of prompt-based generation rather than the model's representational capacity by directly comparing prompt outputs with linear probes over the same internal representations. While zero-shot prompting performs near chance, linear probes improve average F1 from 0.15-0.26 to 0.61-0.67, often matching or exceeding specialized time series models. Layer-wise analyses further show that class-discriminative time series information emerges in early transformer layers and is amplified by visual and multimodal inputs. Together, these results demonstrate a systematic mismatch between what LLMs internally represent and what prompt-based evaluation reveals, leading current evaluations to underestimate their time series understanding.

Problem

Research questions and friction points this paper is trying to address.

time series classification

large language models

prompt-based evaluation

representational capacity

temporal structure

Innovation

Methods, ideas, or system contributions that make the work stand out.

linear probing

time series classification

large language models