Stochastic Clock Attention for Aligning Continuous and Ordered Sequences

📅 2025-09-18

📈 Citations: 0

✨ Influential: 0

career value

213K/year

🤖 AI Summary

Standard scaled dot-product attention lacks explicit modeling of continuous monotonic alignment, limiting performance on frame-synchronous tasks such as text-to-speech (TTS). To address this, we propose stochastic clock attention: it models source–target sequence alignment as the meeting probability of two learned non-negative stochastic clocks, and derives a closed-form Gaussian scoring function via path integral theory—ensuring causality, smoothness, and near-diagonal preference. This mechanism intrinsically enforces continuous monotonic alignment without positional regularization, supports both normalized and unnormalized forms, and unifies parallel and autoregressive decoding. In TTS, it significantly improves alignment stability and robustness to global temporal scaling variations, while maintaining or surpassing baseline models in speech quality.

Technology Category

Application Category

📝 Abstract

We formulate an attention mechanism for continuous and ordered sequences that explicitly functions as an alignment model, which serves as the core of many sequence-to-sequence tasks. Standard scaled dot-product attention relies on positional encodings and masks but does not enforce continuity or monotonicity, which are crucial for frame-synchronous targets. We propose learned nonnegative emph{clocks} to source and target and model attention as the meeting probability of these clocks; a path-integral derivation yields a closed-form, Gaussian-like scoring rule with an intrinsic bias toward causal, smooth, near-diagonal alignments, without external positional regularizers. The framework supports two complementary regimes: normalized clocks for parallel decoding when a global length is available, and unnormalized clocks for autoregressive decoding -- both nearly-parameter-free, drop-in replacements. In a Transformer text-to-speech testbed, this construction produces more stable alignments and improved robustness to global time-scaling while matching or improving accuracy over scaled dot-product baselines. We hypothesize applicability to other continuous targets, including video and temporal signal modeling.

Problem

Research questions and friction points this paper is trying to address.

Aligning continuous sequences without positional regularizers

Enforcing monotonicity in attention for sequence tasks

Replacing scaled dot-product attention with clock mechanism

Innovation

Methods, ideas, or system contributions that make the work stand out.

Learned nonnegative clocks for alignment

Closed-form Gaussian-like scoring rule

Parameter-free drop-in replacement for attention

🔎 Similar Papers

MTGA: Multi-view Temporal Granularity aligned Aggregation for Event-based Lip-reading