Small Vocabularies, Big Gains: Pretraining and Tokenization in Time Series Models

📅 2025-11-06
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work investigates how tokenizer design—specifically scaling and quantization strategies—and pretraining mechanisms affect the representational capacity of time-series foundation models. Method: Through systematic empirical evaluation and theoretical analysis, we decouple the roles of tokenization configuration and pretraining, and propose a lightweight, continuous-signal-aware tokenization paradigm suitable for multimodal shared vocabularies. Contribution/Results: We find that tokenizer configuration predominantly governs model expressivity and stability, whereas pretraining mainly improves optimization efficiency; their synergy enables performance on par with large vocabularies (e.g., ≥1024) using compact ones (≤128). The proposed paradigm significantly enhances cross-modal transferability. Empirical results demonstrate that principled tokenization amplifies pretraining gains, yielding reproducible design principles and theoretical foundations for discrete representation learning in time series.

Technology Category

Application Category

📝 Abstract
Tokenization and transfer learning are two critical components in building state of the art time series foundation models for forecasting. In this work, we systematically study the effect of tokenizer design, specifically scaling and quantization strategies, on model performance, alongside the impact of pretraining versus random initialization. We show that tokenizer configuration primarily governs the representational capacity and stability of the model, while transfer learning influences optimization efficiency and alignment. Using a combination of empirical training experiments and theoretical analyses, we demonstrate that pretrained models consistently leverage well-designed tokenizers more effectively, particularly at smaller vocabulary sizes. Conversely, misaligned tokenization can diminish or even invert the benefits of pretraining. These findings highlight the importance of careful tokenization in time series modeling and suggest that combining small, efficient vocabularies with pretrained weights is especially advantageous in multi-modal forecasting settings, where the overall vocabulary must be shared across modalities. Our results provide concrete guidance for designing tokenizers and leveraging transfer learning in discrete representation learning for continuous signals.
Problem

Research questions and friction points this paper is trying to address.

Studying tokenizer design impact on time series model performance
Analyzing pretraining versus random initialization effects on optimization
Investigating vocabulary size efficiency in multimodal forecasting settings
Innovation

Methods, ideas, or system contributions that make the work stand out.

Tokenizer design with scaling and quantization strategies
Pretrained models leverage small vocabularies effectively
Combining efficient vocabularies with pretrained weights
🔎 Similar Papers
No similar papers found.