🤖 AI Summary
Existing time-series editing methods rely on predefined attribute vectors and iterative sampling, suffering from limited conditional flexibility and poor controllability over editing intensity. This paper introduces the first natural-language-instruction-based time-series editing framework. It constructs a shared multimodal embedding space for time series and textual instructions, enabling continuous intensity control via multi-resolution joint encoding, instruction-conditioned decoding, and embedding interpolation. The method supports both local and global edits, few-shot transfer, and zero-shot generalization to unseen instructions—without requiring hand-crafted attributes or iterative sampling. Evaluated on synthetic and real-world benchmarks, it achieves significant improvements in editing fidelity, semantic alignment, and controllability. Results demonstrate superior flexibility, robustness, and practical applicability across diverse editing scenarios.
📝 Abstract
In time series editing, we aim to modify some properties of a given time series without altering others. For example, when analyzing a hospital patient's blood pressure, we may add a sudden early drop and observe how it impacts their future while preserving other conditions. Existing diffusion-based editors rely on rigid, predefined attribute vectors as conditions and produce all-or-nothing edits through sampling. This attribute- and sampling-based approach limits flexibility in condition format and lacks customizable control over editing strength. To overcome these limitations, we introduce Instruction-based Time Series Editing, where users specify intended edits using natural language. This allows users to express a wider range of edits in a more accessible format. We then introduce InstructTime, the first instruction-based time series editor. InstructTime takes in time series and instructions, embeds them into a shared multi-modal representation space, then decodes their embeddings to generate edited time series. By learning a structured multi-modal representation space, we can easily interpolate between embeddings to achieve varying degrees of edit. To handle local and global edits together, we propose multi-resolution encoders. In our experiments, we use synthetic and real datasets and find that InstructTime is a state-of-the-art time series editor: InstructTime achieves high-quality edits with controllable strength, can generalize to unseen instructions, and can be easily adapted to unseen conditions through few-shot learning.