🤖 AI Summary
This work addresses the performance bottlenecks in few-shot multimodal time series classification caused by data scarcity and distribution shifts. To this end, it proposes MarsTSC, the first multi-agent reasoning framework based on vision-language models (VLMs). MarsTSC introduces an innovative tripartite agent mechanism—comprising a generator, a reflector, and a modifier—that collaboratively optimizes dynamic contextual reasoning and ensures reliable classification through a self-evolving knowledge base and a cautious test-time updating strategy. The framework is compatible with six VLM backbones and consistently achieves state-of-the-art performance across twelve established benchmarks. Moreover, it generates human-interpretable justifications for its predictions, enhancing model transparency and trustworthiness.
📝 Abstract
In this paper, we propose the first VL$\underline{\textbf{M}}$ $\underline{\textbf{a}}$gentic $\underline{\textbf{r}}$easoning framework for few-$\underline{\textbf{s}}$hot multimodal $\underline{\textbf{T}}$ime $\underline{\textbf{S}}$eries $\underline{\textbf{C}}$lassification ($\textbf{MarsTSC}$), which introduces a self-evolving knowledge bank as a dynamic context iteratively refined via reflective agentic reasoning. The framework comprises three collaborative roles: i) Generator conducts reliable classification via reasoning; ii) Reflector diagnoses the root causes of reasoning errors to yield discriminative insights targeting the temporal features overlooked by Generator; iii) Modifier applies verified updates to the knowledge bank to prevent context collapse. We further introduce a test-time update strategy to enable cautious, continuous knowledge bank refinement to mitigate few-shot bias and distribution shift. Extensive experiments across 12 mainstream time series benchmarks demonstrate that $\textbf{MarsTSC}$ delivers substantial and consistent performance gains across 6 VLM backbones, outperforming both classical and foundation model-based time series baselines under few-shot conditions, while producing interpretable rationales that ground each classification decision in human-readable feature evidence.