Small Vocabularies, Big Gains: Pretraining and Tokenization in Time Series Models

📅 2025-11-06

📈 Citations: 0

✨ Influential: 0

career value

153K/year

🤖 AI Summary

This work investigates how tokenizer design—specifically scaling and quantization strategies—and pretraining mechanisms affect the representational capacity of time-series foundation models. Method: Through systematic empirical evaluation and theoretical analysis, we decouple the roles of tokenization configuration and pretraining, and propose a lightweight, continuous-signal-aware tokenization paradigm suitable for multimodal shared vocabularies. Contribution/Results: We find that tokenizer configuration predominantly governs model expressivity and stability, whereas pretraining mainly improves optimization efficiency; their synergy enables performance on par with large vocabularies (e.g., ≥1024) using compact ones (≤128). The proposed paradigm significantly enhances cross-modal transferability. Empirical results demonstrate that principled tokenization amplifies pretraining gains, yielding reproducible design principles and theoretical foundations for discrete representation learning in time series.

Technology Category

Application Category

📝 Abstract

Tokenization and transfer learning are two critical components in building state of the art time series foundation models for forecasting. In this work, we systematically study the effect of tokenizer design, specifically scaling and quantization strategies, on model performance, alongside the impact of pretraining versus random initialization. We show that tokenizer configuration primarily governs the representational capacity and stability of the model, while transfer learning influences optimization efficiency and alignment. Using a combination of empirical training experiments and theoretical analyses, we demonstrate that pretrained models consistently leverage well-designed tokenizers more effectively, particularly at smaller vocabulary sizes. Conversely, misaligned tokenization can diminish or even invert the benefits of pretraining. These findings highlight the importance of careful tokenization in time series modeling and suggest that combining small, efficient vocabularies with pretrained weights is especially advantageous in multi-modal forecasting settings, where the overall vocabulary must be shared across modalities. Our results provide concrete guidance for designing tokenizers and leveraging transfer learning in discrete representation learning for continuous signals.

Problem

Research questions and friction points this paper is trying to address.

Studying tokenizer design impact on time series model performance

Analyzing pretraining versus random initialization effects on optimization

Investigating vocabulary size efficiency in multimodal forecasting settings

Innovation

Methods, ideas, or system contributions that make the work stand out.

Tokenizer design with scaling and quantization strategies

Pretrained models leverage small vocabularies effectively

Combining efficient vocabularies with pretrained weights

🔎 Similar Papers

Multiple-Resolution Tokenization for Time Series Forecasting with an Application to Pricing