π€ AI Summary
Existing video translation methods exhibit significantly weaker local structural preservation and temporal consistency compared to image-based models, frequently suffering from motion jitter and structural distortion. This paper proposes a diffusion-based framework for temporally consistent human action video translation. Its core innovation is the first introduction of Query Warpingβa mechanism that explicitly aligns cross-frame query tokens via appearance flow and constrains self-attention outputs during denoising to ensure motion coherence. The approach jointly preserves local geometric fidelity and global temporal stability. Evaluated on multiple human action video translation benchmarks, our method achieves new state-of-the-art performance, with substantial improvements in quantitative metrics (e.g., FVD, LPIPS) and qualitative visual quality. It effectively mitigates motion jitter and structural artifacts, demonstrating superior spatiotemporal consistency.
π Abstract
In this paper, we present QueryWarp, a novel framework for temporally coherent human motion video translation. Existing diffusion-based video editing approaches that rely solely on key and value tokens to ensure temporal consistency, which scarifies the preservation of local and structural regions. In contrast, we aim to consider complementary query priors by constructing the temporal correlations among query tokens from different frames. Initially, we extract appearance flows from source poses to capture continuous human foreground motion. Subsequently, during the denoising process of the diffusion model, we employ appearance flows to warp the previous frame's query token, aligning it with the current frame's query. This query warping imposes explicit constraints on the outputs of self-attention layers, effectively guaranteeing temporally coherent translation. We perform experiments on various human motion video translation tasks, and the results demonstrate that our QueryWarp framework surpasses state-of-the-art methods both qualitatively and quantitatively.