Empowering VLMs for Few-Shot Multimodal Time Series Classification via Tailored Agentic Reasoning

📅 2026-05-10
📈 Citations: 0
Influential: 0
📄 PDF

career value

191K/year
🤖 AI Summary
This work addresses the performance bottlenecks in few-shot multimodal time series classification caused by data scarcity and distribution shifts. To this end, it proposes MarsTSC, the first multi-agent reasoning framework based on vision-language models (VLMs). MarsTSC introduces an innovative tripartite agent mechanism—comprising a generator, a reflector, and a modifier—that collaboratively optimizes dynamic contextual reasoning and ensures reliable classification through a self-evolving knowledge base and a cautious test-time updating strategy. The framework is compatible with six VLM backbones and consistently achieves state-of-the-art performance across twelve established benchmarks. Moreover, it generates human-interpretable justifications for its predictions, enhancing model transparency and trustworthiness.
📝 Abstract
In this paper, we propose the first VL$\underline{\textbf{M}}$ $\underline{\textbf{a}}$gentic $\underline{\textbf{r}}$easoning framework for few-$\underline{\textbf{s}}$hot multimodal $\underline{\textbf{T}}$ime $\underline{\textbf{S}}$eries $\underline{\textbf{C}}$lassification ($\textbf{MarsTSC}$), which introduces a self-evolving knowledge bank as a dynamic context iteratively refined via reflective agentic reasoning. The framework comprises three collaborative roles: i) Generator conducts reliable classification via reasoning; ii) Reflector diagnoses the root causes of reasoning errors to yield discriminative insights targeting the temporal features overlooked by Generator; iii) Modifier applies verified updates to the knowledge bank to prevent context collapse. We further introduce a test-time update strategy to enable cautious, continuous knowledge bank refinement to mitigate few-shot bias and distribution shift. Extensive experiments across 12 mainstream time series benchmarks demonstrate that $\textbf{MarsTSC}$ delivers substantial and consistent performance gains across 6 VLM backbones, outperforming both classical and foundation model-based time series baselines under few-shot conditions, while producing interpretable rationales that ground each classification decision in human-readable feature evidence.
Problem

Research questions and friction points this paper is trying to address.

Few-Shot Learning
Multimodal Time Series Classification
Visual Language Models
Distribution Shift
Temporal Feature Understanding
Innovation

Methods, ideas, or system contributions that make the work stand out.

Agentic Reasoning
Few-Shot Time Series Classification
Self-Evolving Knowledge Bank
Multimodal Time Series
Visual Language Models