TokenDial: Continuous Attribute Control in Text-to-Video via Spatiotemporal Token Offsets

📅 2026-03-29
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing text-to-video generation models struggle to enable fine-grained, continuous control over attribute variations—such as effect intensity or motion magnitude—while preserving identity, background, and temporal consistency. This work proposes a training-free approach that requires no fine-tuning of the backbone model: by introducing adjustable additive offsets in the spatiotemporal visual patch-token latent space of a pretrained model, it enables slider-based continuous editing. The method innovatively formulates semantic control as directional token shifts, combining semantic direction alignment with motion magnitude scaling to effectively disentangle appearance from dynamic attributes. Experiments demonstrate that the proposed approach consistently outperforms state-of-the-art methods across diverse attributes and prompts, achieving superior performance validated by both quantitative metrics and human evaluations in terms of controllability and generation quality.
📝 Abstract
We present TokenDial, a framework for continuous, slider-style attribute control in pretrained text-to-video generation models. While modern generators produce strong holistic videos, they offer limited control over how much an attribute changes (e.g., effect intensity or motion magnitude) without drifting identity, background, or temporal coherence. TokenDial is built on the observation: additive offsets in the intermediate spatiotemporal visual patch-token space form a semantic control direction, where adjusting the offset magnitude yields coherent, predictable edits for both appearance and motion dynamics. We learn attribute-specific token offsets without retraining the backbone, using pretrained understanding signals: semantic direction matching for appearance and motion-magnitude scaling for motion. We demonstrate TokenDial's effectiveness on diverse attributes and prompts, achieving stronger controllability and higher-quality edits than state-of-the-art baselines, supported by extensive quantitative evaluation and human studies.
Problem

Research questions and friction points this paper is trying to address.

attribute control
text-to-video generation
continuous control
temporal coherence
spatiotemporal tokens
Innovation

Methods, ideas, or system contributions that make the work stand out.

token offset
continuous attribute control
text-to-video generation
spatiotemporal editing
semantic direction
🔎 Similar Papers