🤖 AI Summary
Standard scaled dot-product attention lacks explicit modeling of continuous monotonic alignment, limiting performance on frame-synchronous tasks such as text-to-speech (TTS). To address this, we propose stochastic clock attention: it models source–target sequence alignment as the meeting probability of two learned non-negative stochastic clocks, and derives a closed-form Gaussian scoring function via path integral theory—ensuring causality, smoothness, and near-diagonal preference. This mechanism intrinsically enforces continuous monotonic alignment without positional regularization, supports both normalized and unnormalized forms, and unifies parallel and autoregressive decoding. In TTS, it significantly improves alignment stability and robustness to global temporal scaling variations, while maintaining or surpassing baseline models in speech quality.
📝 Abstract
We formulate an attention mechanism for continuous and ordered sequences that explicitly functions as an alignment model, which serves as the core of many sequence-to-sequence tasks. Standard scaled dot-product attention relies on positional encodings and masks but does not enforce continuity or monotonicity, which are crucial for frame-synchronous targets. We propose learned nonnegative emph{clocks} to source and target and model attention as the meeting probability of these clocks; a path-integral derivation yields a closed-form, Gaussian-like scoring rule with an intrinsic bias toward causal, smooth, near-diagonal alignments, without external positional regularizers. The framework supports two complementary regimes: normalized clocks for parallel decoding when a global length is available, and unnormalized clocks for autoregressive decoding -- both nearly-parameter-free, drop-in replacements. In a Transformer text-to-speech testbed, this construction produces more stable alignments and improved robustness to global time-scaling while matching or improving accuracy over scaled dot-product baselines. We hypothesize applicability to other continuous targets, including video and temporal signal modeling.